Python正则多行模式处理及多公司名替换失效问题解决方案
Hey there, let's figure out why your regex substitution isn't working and fix it properly—especially for those company names with spaces.
First, why your initial code failed
Your first regex """ A | B | C | D """ has extra spaces around each letter. That means it's looking for " A " (with spaces before and after) instead of just "A". Since your text has "A is..." (no space before A) and "so is B" (no space after B), the regex can't find any matches. That's why the substitution does nothing.
Fixing the core issue & handling spaced company names
When dealing with company names that include spaces (like "Berkshire Hathaway"), you need to avoid two common pitfalls: extra spaces in your regex, and shorter names accidentally matching parts of longer ones. Here's a step-by-step solution:
1. Clean up your company name list
First, trim any accidental leading/trailing spaces from each company name. For example, your Australia & New Zealand Bank has a trailing space—remove that, otherwise the regex will look for that extra space which might not exist in your text.
2. Sort names by length (longest first)
If you have overlapping or similar names (e.g., "AIG" and "AIG Group"), shorter names will match first and break longer ones. Sorting from longest to shortest ensures the full name gets matched first.
3. Compile the regex correctly
Use re.escape() to automatically handle any special characters in your company names (like &), then join them with |. Here's the working code:
import re # Your cleaned-up company name list company_names = [ "Berkshire Hathaway", "Australia & New Zealand Bank", "Ind & Comm Bank of China", "BNP Paribas", "Wells Fargo", "AIG" ] # Sort names by length descending to prioritize longer matches sorted_names = sorted(company_names, key=lambda x: -len(x)) # Build the regex pattern, escaping special characters in each name pattern = "|".join(re.escape(name) for name in sorted_names) # Compile the regex company_re = re.compile(pattern) # Test it with sample text text = "Berkshire Hathaway is a great company, so is Australia & New Zealand Bank and AIG." result = company_re.sub("company", text) print(result) # Output: company is a great company, so is company and company.
4. Optional: Avoid partial matches
If you want to make sure you don't accidentally match parts of other words (e.g., if your text has "XAIGY" and you don't want to replace "AIG" in it), use lookarounds to match only when the company name is surrounded by whitespace or the start/end of the string:
# Adjust the pattern to include whitespace/start/end checks pattern = r"(?:^|\s)({})(?:\s|$)".format("|".join(re.escape(name) for name in sorted_names)) company_re = re.compile(pattern) # Use backreferences to preserve surrounding whitespace during substitution result = company_re.sub(r"\1company\2", text)
Using re.VERBOSE for readability
If you prefer writing your regex across multiple lines (like your initial """ """ approach), use the re.VERBOSE flag—but note that regular spaces are ignored in this mode. Escape spaces in company names with \ to keep them intact:
pattern_verbose = """ Berkshire\ Hathaway| Australia\ &\ New\ Zealand\ Bank| Ind\ &\ Comm\ Bank\ of\ China| BNP\ Paribas| Wells\ Fargo| AIG """ company_re_verbose = re.compile(pattern_verbose, re.VERBOSE) # This works exactly like the earlier version result = company_re_verbose.sub("company", text)
That should fix your substitution issues completely!
内容的提问来源于stack exchange,提问作者Lei Hao




