Python中能否用正则处理嵌套/重叠关键词的高亮替换？

阿华AIGC实验室

2026-5-6

How to handle overlapping/nested keyword highlighting with regex in Python

Got it, let's break down your problem. The issue with your current code is that regex's default behavior is to match non-overlapping patterns. When you sort terms by length descending and use | to combine them, longer terms get matched first—but once a part of the string is consumed by a match, overlapping shorter terms can't be matched anymore (like "ame" in case 2, which overlaps with "gam").

Let's fix this with two approaches, depending on whether you want merged (clean) highlighting or nested highlighting as you mentioned.

Approach 1: Merge overlapping intervals (ideal for clean, non-nested highlighting)

This method collects all the positions where your keywords appear, merges overlapping or adjacent intervals, then applies the  tags to the merged ranges. This gives you the "ideal result" you described for both cases.

Here's the code:

import re

def highlight_keywords(content, query_term):
    # Extract valid terms from the query
    terms = re.findall(r'[a-z0-9]+', query_term, re.IGNORECASE)
    if not terms:
        return content
    
    # Collect all match positions (start, end indices)
    matches = []
    for term in terms:
        # Use re.escape to handle special characters in terms
        pattern = re.escape(term)
        for match in re.finditer(pattern, content, flags=re.IGNORECASE):
            matches.append(match.span())
    
    # Sort matches by start position
    matches.sort()
    
    # Merge overlapping or adjacent intervals
    merged_intervals = [matches[0]]
    for current_start, current_end in matches[1:]:
        last_start, last_end = merged_intervals[-1]
        if current_start <= last_end:
            # Overlap or adjacent, merge the intervals
            merged_intervals[-1] = (last_start, max(last_end, current_end))
        else:
            merged_intervals.append((current_start, current_end))
    
    # Build the highlighted content
    result = []
    prev_end = 0
    for start, end in merged_intervals:
        # Add the non-highlighted part before this interval
        result.append(content[prev_end:start])
        # Add the highlighted part
        result.append(f'&lt;em&gt;{content[start:end]}&lt;/em&gt;')
        prev_end = end
    # Add any remaining non-highlighted content
    result.append(content[prev_end:])
    
    return ''.join(result)

# Test Case 1
content1 = "staging_datastorage"
query1 = "st ta ag"
print(highlight_keywords(content1, query1))
# Output: &lt;em&gt;stag&lt;/em&gt;ing_da&lt;em&gt;tast&lt;/em&gt;or&lt;em&gt;ag&lt;/em&gt;e

# Test Case 2
content2 = "game_event"
query2 = "gam ame"
print(highlight_keywords(content2, query2))
# Output: &lt;em&gt;game&lt;/em&gt;_event

How this works:

Extract terms: We first pull out all valid alphanumeric terms from your query.
Collect matches: For each term, we find every occurrence (including overlapping ones) and record their start/end positions.
Merge intervals: We sort the positions and merge any overlapping or adjacent ranges—so "gam" (0-2) and "ame" (1-3) become a single range (0-3), covering the entire "game" string.
Build result: We construct the final string by adding non-highlighted parts, then highlighted parts for each merged interval.

Approach 2: Allow nested highlighting (if you want overlapping tags)

If you specifically want nested tags like ggame_event, we can use a different approach that iteratively applies highlighting to all terms, even if they overlap with already highlighted content. Note that this will modify the string with tags, so we need to adjust our pattern to ignore existing tags (otherwise we'll match inside  tags).

Here's the code:

import re

def nested_highlight_keywords(content, query_term):
    terms = re.findall(r'[a-z0-9]+', query_term, re.IGNORECASE)
    if not terms:
        return content
    
    # Sort terms by length descending to prioritize longer matches first (even in nested cases)
    terms.sort(key=len, reverse=True)
    
    # Pattern to match terms, ignoring any existing <em> tags (so we don't highlight inside them)
    # We use negative lookarounds to avoid matching within the tag syntax
    pattern = re.compile(
        r'(?<!&lt;em&gt;)(?!&lt;/em&gt;)({})'.format('|'.join(re.escape(t) for t in terms)),
        re.IGNORECASE
    )
    
    # Iteratively replace until no more matches are found (handles nested overlaps)
    while True:
        new_content = pattern.sub(r'&lt;em&gt;\1&lt;/em&gt;', content)
        if new_content == content:
            break
        content = new_content
    
    return content

# Test Case 2
content2 = "game_event"
query2 = "gam ame"
print(nested_highlight_keywords(content2, query2))
# Output: &lt;em&gt;g&lt;em&gt;gam&lt;/em&gt;e&lt;/em&gt;_event

How this works:

We use negative lookarounds to ensure we don't match text inside existing  tags.
We loop the replacement until no more changes happen, which allows overlapping terms to be wrapped in nested tags.

Why your original code failed

Your original regex uses re.sub with a combined pattern of sorted terms. Once "gam" is matched in case 2, those characters are consumed by the regex engine, so "ame" (which overlaps with "gam") can't be matched anymore. The interval merging approach avoids this by tracking all positions first, instead of relying on regex's non-overlapping matching behavior.

内容的提问来源于stack exchange，提问作者Xu Wang