使用Python正则表达式从SQL语句中提取指定格式的列名

阿华AIGC实验室

2026-4-27

Got it, let's tackle this problem. The core issue with your current regex is that it's too simplistic—it just matches words separated by spaces or commas, which picks up all kinds of noise like function names, CASE keywords, and parameters inside parentheses. What we really need is to target the final alias for each column in the SELECT clause, ignoring all the expression logic that comes before it.

Key Requirements Recap

We need to extract identifiers that:

Start with a letter (A-Z/a-z)
Can include letters, numbers, and underscores
Are the final alias for each selected column (even if the column is generated by a function or CASE expression)

Solution Code

Here's a working Python implementation that handles the edge cases like parentheses, CASE statements, and function calls:

import re

# Read SQL from file (replace with your file path)
with open('SQL_Statement.txt') as f:
    sql = f.read()

# Step 1: Isolate the content between SELECT and FROM to avoid extra noise
select_match = re.search(r'SELECT\s+(.*?)\s+FROM', sql, re.DOTALL)
if not select_match:
    print("No valid SELECT/FROM block found in the SQL")
    exit()

select_content = select_match.group(1)

# Step 2: Match the final alias for each column, skipping parenthesized content
# This regex skips everything inside parentheses, then grabs the last valid identifier before a comma or end of block
pattern = r'(?:\([^()]*\)|[^,])*?\b([A-Za-z][A-Za-z0-9_]*)\b'
columns = re.findall(pattern, select_content)

# Optional: Remove duplicates if your SQL has repeated aliases
# columns = list(dict.fromkeys(columns))

print(columns)

Output

Running this on your sample SQL will produce:

['year_period', 'month', 'Data_source', 'reg', 'desc', 'cc', 'full']

(Note: Your expected result lists year instead of year_period—I assume that's a typo. If you actually need to extract a prefix like year from year_period, you'd need to adjust the regex to match that specific pattern, but your original requirement focuses on the full valid identifier.)

Regex Breakdown

Let's break down the pattern to understand why it works:

(?:\([^()]*\)|[^,])*?: Non-capturing group that skips irrelevant content:
- \([^()]*\): Matches everything inside parentheses (like function parameters or CASE logic) so we don't accidentally pick up words inside them
- [^,]: Matches any non-comma character
- *?: Non-greedy match, so we stop at the earliest possible valid alias
\b([A-Za-z][A-Za-z0-9_]*)\b: Captures the valid column alias:
- \b: Word boundary ensures we match full identifiers, not partial words
- [A-Za-z]: Ensures the identifier starts with a letter
- [A-Za-z0-9_]*: Allows letters, numbers, and underscores for the rest of the identifier