自定义Grok正则匹配器：编写正则表达式解析日志文件

阿华AIGC实验室

2026-5-21

解析Rails风格日志的正则表达式方案

Hey there! Let's figure out how to parse these Rails-style log lines with a regex. First, let's take a look at your sample logs to understand their structure:

I, [2018-03-23T13:30:10.076546 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03
I, [2018-03-23T13:31:23.488928 #3107] INFO -- : method='GET' path='/feed/bc822bc19.csv' format= ip='127.0.0.0' status=200 duration=0.04 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:31:30.956484 #3107] INFO -- : method='GET' path='/feed/ad4d93bee.csv' format= ip='127.0.0.0' status=200 duration=0.05 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:32:10.123399 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:33:46.362908 #3107] INFO -- : method='GET' path='/feed/e9cbe2f42e0a6.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:34:10.060682 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:01.445029 #3107] INFO -- : method='GET' path='/feed/85b91d6f7.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http'
I, [2018-03-23T13:35:04.486874 #3107] INFO -- : method='GET' path='/feed/34bda5b6f.csv' format= ip='127.0.0.0' status=200 duration=0.33 host='feeds' user...

Each line splits into two main sections:

The header: Contains the timestamp, process ID, and log level
The payload: A set of key-value pairs with request details. Values can be quoted (single quotes), unquoted, empty, or even truncated (like the last line's incomplete user...).

实用正则表达式方案

Here's a comprehensive regex that captures all the key fields in your logs, including optional ones, and handles all the edge cases we see:

^I, \[(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+) #(?<pid>\d+)\] INFO -- : method='(?<method>[^']+)' path='(?<path>[^']+)' format=(?<format>'[^']*'|\S*|) ip=(?<ip>'[^']*'|\S*|) status=(?<status>\d+) duration=(?<duration>\d+\.\d+)(?: host='(?<host>[^']+)')?(?: user='(?<user>[^']+)')?(?: params=(?<params>\{.*\}))?(?: agent='(?<agent>[^']*)')?(?: protocol='(?<protocol>[^']+)')?.*$

正则逐段解释

Let's break down what each part does:

Header Capture:
- (?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+): Grabs the full ISO8601 timestamp including microseconds
- (?<pid>\d+): Captures the numeric process ID
Payload Fields:
- Required fields (like method/path): method='(?<method>[^']+)' targets the quoted value reliably (since these always have quotes in your logs)
- Flexible fields (like format/ip): format=(?<format>'[^']*'|\S*|) handles three scenarios: quoted values, unquoted non-whitespace values, or empty entries
- Optional fields (like host/user): Wrapped in (?: ...)? so the regex still matches lines that don't include these fields
- Final .*: Catches truncated content (like the last line's user...) without breaking the rest of the match

使用示例（Ruby，适配Rails环境）

Since these look like Rails logs, here's how you can use this regex in Ruby to extract values:

log_line = "I, [2018-03-23T13:31:23.488928 #3107] INFO -- : method='GET' path='/feed/bc822bc19.csv' format= ip='127.0.0.0' status=200 duration=0.04 host='feeds' user='-' params={} agent='' protocol='http'"

regex = /^I, \[(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+) #(?<pid>\d+)\] INFO -- : method='(?<method>[^']+)' path='(?<path>[^']+)' format=(?<format>'[^']*'|\S*|) ip=(?<ip>'[^']*'|\S*|) status=(?<status>\d+) duration=(?<duration>\d+\.\d+)(?: host='(?<host>[^']+)')?(?: user='(?<user>[^']+)')?(?: params=(?<params>\{.*\}))?(?: agent='(?<agent>[^']*)')?(?: protocol='(?<protocol>[^']+)')?.*$/

match = regex.match(log_line)

# Extract and clean values (remove quotes from fields that might have them)
puts "Timestamp: #{match[:timestamp]}"
puts "Process ID: #{match[:pid]}"
puts "HTTP Method: #{match[:method]}"
puts "Request Path: #{match[:path]}"
puts "IP Address: #{match[:ip].delete("'")}"
puts "Status Code: #{match[:status]}"
puts "Duration: #{match[:duration]}s"
puts "Host: #{match[:host]}"

This will output:

Timestamp: 2018-03-23T13:31:23.488928
Process ID: 3107
HTTP Method: GET
Request Path: /feed/bc822bc19.csv
IP Address: 127.0.0.0
Status Code: 200
Duration: 0.04s
Host: feeds

额外提示

If your logs have other fields not covered here, just add optional groups following the pattern of (?: host='(?<host>[^']+)')?
For unquoted values with spaces (not present in your samples), you'd need to adjust the value patterns, but your logs use spaces as key-value separators so non-whitespace works for unquoted entries
Truncated lines will still capture all valid fields up to the cut-off point

内容的提问来源于stack exchange，提问作者Lucian Tarna