Python中能否用正则处理嵌套/重叠关键词的高亮替换?
Got it, let's break down your problem. The issue with your current code is that regex's default behavior is to match non-overlapping patterns. When you sort terms by length descending and use | to combine them, longer terms get matched first—but once a part of the string is consumed by a match, overlapping shorter terms can't be matched anymore (like "ame" in case 2, which overlaps with "gam").
Let's fix this with two approaches, depending on whether you want merged (clean) highlighting or nested highlighting as you mentioned.
Approach 1: Merge overlapping intervals (ideal for clean, non-nested highlighting)
This method collects all the positions where your keywords appear, merges overlapping or adjacent intervals, then applies the <em> tags to the merged ranges. This gives you the "ideal result" you described for both cases.
Here's the code:
import re def highlight_keywords(content, query_term): # Extract valid terms from the query terms = re.findall(r'[a-z0-9]+', query_term, re.IGNORECASE) if not terms: return content # Collect all match positions (start, end indices) matches = [] for term in terms: # Use re.escape to handle special characters in terms pattern = re.escape(term) for match in re.finditer(pattern, content, flags=re.IGNORECASE): matches.append(match.span()) # Sort matches by start position matches.sort() # Merge overlapping or adjacent intervals merged_intervals = [matches[0]] for current_start, current_end in matches[1:]: last_start, last_end = merged_intervals[-1] if current_start <= last_end: # Overlap or adjacent, merge the intervals merged_intervals[-1] = (last_start, max(last_end, current_end)) else: merged_intervals.append((current_start, current_end)) # Build the highlighted content result = [] prev_end = 0 for start, end in merged_intervals: # Add the non-highlighted part before this interval result.append(content[prev_end:start]) # Add the highlighted part result.append(f'<em>{content[start:end]}</em>') prev_end = end # Add any remaining non-highlighted content result.append(content[prev_end:]) return ''.join(result) # Test Case 1 content1 = "staging_datastorage" query1 = "st ta ag" print(highlight_keywords(content1, query1)) # Output: <em>stag</em>ing_da<em>tast</em>or<em>ag</em>e # Test Case 2 content2 = "game_event" query2 = "gam ame" print(highlight_keywords(content2, query2)) # Output: <em>game</em>_event
How this works:
- Extract terms: We first pull out all valid alphanumeric terms from your query.
- Collect matches: For each term, we find every occurrence (including overlapping ones) and record their start/end positions.
- Merge intervals: We sort the positions and merge any overlapping or adjacent ranges—so "gam" (0-2) and "ame" (1-3) become a single range (0-3), covering the entire "game" string.
- Build result: We construct the final string by adding non-highlighted parts, then highlighted parts for each merged interval.
Approach 2: Allow nested highlighting (if you want overlapping tags)
If you specifically want nested tags like <em>g<em>gam</em>e</em>_event, we can use a different approach that iteratively applies highlighting to all terms, even if they overlap with already highlighted content. Note that this will modify the string with tags, so we need to adjust our pattern to ignore existing tags (otherwise we'll match inside <em> tags).
Here's the code:
import re def nested_highlight_keywords(content, query_term): terms = re.findall(r'[a-z0-9]+', query_term, re.IGNORECASE) if not terms: return content # Sort terms by length descending to prioritize longer matches first (even in nested cases) terms.sort(key=len, reverse=True) # Pattern to match terms, ignoring any existing <em> tags (so we don't highlight inside them) # We use negative lookarounds to avoid matching within the tag syntax pattern = re.compile( r'(?<!<em>)(?!</em>)({})'.format('|'.join(re.escape(t) for t in terms)), re.IGNORECASE ) # Iteratively replace until no more matches are found (handles nested overlaps) while True: new_content = pattern.sub(r'<em>\1</em>', content) if new_content == content: break content = new_content return content # Test Case 2 content2 = "game_event" query2 = "gam ame" print(nested_highlight_keywords(content2, query2)) # Output: <em>g<em>gam</em>e</em>_event
How this works:
- We use negative lookarounds to ensure we don't match text inside existing
<em>tags. - We loop the replacement until no more changes happen, which allows overlapping terms to be wrapped in nested tags.
Why your original code failed
Your original regex uses re.sub with a combined pattern of sorted terms. Once "gam" is matched in case 2, those characters are consumed by the regex engine, so "ame" (which overlaps with "gam") can't be matched anymore. The interval merging approach avoids this by tracking all positions first, instead of relying on regex's non-overlapping matching behavior.
内容的提问来源于stack exchange,提问作者Xu Wang




