自定义Grok正则匹配器:编写正则表达式解析日志文件
解析Rails风格日志的正则表达式方案
Hey there! Let's figure out how to parse these Rails-style log lines with a regex. First, let's take a look at your sample logs to understand their structure:
I, [2018-03-23T13:30:10.076546 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 I, [2018-03-23T13:31:23.488928 #3107] INFO -- : method='GET' path='/feed/bc822bc19.csv' format= ip='127.0.0.0' status=200 duration=0.04 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:31:30.956484 #3107] INFO -- : method='GET' path='/feed/ad4d93bee.csv' format= ip='127.0.0.0' status=200 duration=0.05 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:32:10.123399 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:33:46.362908 #3107] INFO -- : method='GET' path='/feed/e9cbe2f42e0a6.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:34:10.060682 #3107] INFO -- : method='HEAD' path='/healthcheck' format='*/*' ip= status=200 duration=0.03 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:35:01.445029 #3107] INFO -- : method='GET' path='/feed/85b91d6f7.xml' format= ip='127.0.0.0' status=200 duration=0.02 host='feeds' user='-' params={} agent='' protocol='http' I, [2018-03-23T13:35:04.486874 #3107] INFO -- : method='GET' path='/feed/34bda5b6f.csv' format= ip='127.0.0.0' status=200 duration=0.33 host='feeds' user...
Each line splits into two main sections:
- The header: Contains the timestamp, process ID, and log level
- The payload: A set of key-value pairs with request details. Values can be quoted (single quotes), unquoted, empty, or even truncated (like the last line's incomplete
user...).
实用正则表达式方案
Here's a comprehensive regex that captures all the key fields in your logs, including optional ones, and handles all the edge cases we see:
^I, \[(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+) #(?<pid>\d+)\] INFO -- : method='(?<method>[^']+)' path='(?<path>[^']+)' format=(?<format>'[^']*'|\S*|) ip=(?<ip>'[^']*'|\S*|) status=(?<status>\d+) duration=(?<duration>\d+\.\d+)(?: host='(?<host>[^']+)')?(?: user='(?<user>[^']+)')?(?: params=(?<params>\{.*\}))?(?: agent='(?<agent>[^']*)')?(?: protocol='(?<protocol>[^']+)')?.*$
正则逐段解释
Let's break down what each part does:
Header Capture:
(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+): Grabs the full ISO8601 timestamp including microseconds(?<pid>\d+): Captures the numeric process ID
Payload Fields:
- Required fields (like
method/path):method='(?<method>[^']+)'targets the quoted value reliably (since these always have quotes in your logs) - Flexible fields (like
format/ip):format=(?<format>'[^']*'|\S*|)handles three scenarios: quoted values, unquoted non-whitespace values, or empty entries - Optional fields (like
host/user): Wrapped in(?: ...)?so the regex still matches lines that don't include these fields - Final
.*: Catches truncated content (like the last line'suser...) without breaking the rest of the match
- Required fields (like
使用示例(Ruby,适配Rails环境)
Since these look like Rails logs, here's how you can use this regex in Ruby to extract values:
log_line = "I, [2018-03-23T13:31:23.488928 #3107] INFO -- : method='GET' path='/feed/bc822bc19.csv' format= ip='127.0.0.0' status=200 duration=0.04 host='feeds' user='-' params={} agent='' protocol='http'" regex = /^I, \[(?<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+) #(?<pid>\d+)\] INFO -- : method='(?<method>[^']+)' path='(?<path>[^']+)' format=(?<format>'[^']*'|\S*|) ip=(?<ip>'[^']*'|\S*|) status=(?<status>\d+) duration=(?<duration>\d+\.\d+)(?: host='(?<host>[^']+)')?(?: user='(?<user>[^']+)')?(?: params=(?<params>\{.*\}))?(?: agent='(?<agent>[^']*)')?(?: protocol='(?<protocol>[^']+)')?.*$/ match = regex.match(log_line) # Extract and clean values (remove quotes from fields that might have them) puts "Timestamp: #{match[:timestamp]}" puts "Process ID: #{match[:pid]}" puts "HTTP Method: #{match[:method]}" puts "Request Path: #{match[:path]}" puts "IP Address: #{match[:ip].delete("'")}" puts "Status Code: #{match[:status]}" puts "Duration: #{match[:duration]}s" puts "Host: #{match[:host]}"
This will output:
Timestamp: 2018-03-23T13:31:23.488928 Process ID: 3107 HTTP Method: GET Request Path: /feed/bc822bc19.csv IP Address: 127.0.0.0 Status Code: 200 Duration: 0.04s Host: feeds
额外提示
- If your logs have other fields not covered here, just add optional groups following the pattern of
(?: host='(?<host>[^']+)')? - For unquoted values with spaces (not present in your samples), you'd need to adjust the value patterns, but your logs use spaces as key-value separators so non-whitespace works for unquoted entries
- Truncated lines will still capture all valid fields up to the cut-off point
内容的提问来源于stack exchange,提问作者Lucian Tarna




