如何在Pandas DataFrame中高效匹配关键词并返回具体匹配词
Efficiently Extract Matched Keywords from Pandas DataFrame Column
Got it, let's fix this without sacrificing speed—we can keep the efficiency of vectorized operations while capturing the exact keywords that match each row.
The problem with your iterows() approach is that it’s looping through every row in pure Python, which is glacial for large datasets. Instead, we can use Pandas’ built-in vectorized string methods (the same engine that makes str.contains() fast) combined with regex to extract all matching keywords in one go.
Step-by-Step Solution
- Clean up your keyword list: First, remove duplicates from
wordList(you’ve got repeats likeStringSixandStringSeven) to avoid redundant matches. - Build a safe regex pattern: Escape any special characters in your keywords (like
.,*, or+) to ensure they match literally, then join them into a regex pattern. - Extract matches with
str.findall: This vectorized method finds all keywords present in eachCASE_DESCRIPTIONentry. - Format results: Convert the list of matches into a comma-separated string, and handle empty matches if needed.
Here’s the code:
import pandas as pd import re # Your existing data and keyword list # case_content = pd.DataFrame(...) # wordList = ['StringOne','StringTwo','StringThree','StringFour','StringFive','StringSix','StringSeven','StringSix','StringSeven','StringEight','StringNine','StringTen'] # Step 1: Remove duplicate keywords and escape special characters unique_keywords = list(set(wordList)) safe_pattern = '|'.join([re.escape(word) for word in unique_keywords]) # Step 2: Extract all matching keywords for each row case_content['wordFound'] = case_content['CASE_DESCRIPTION'].str.findall( safe_pattern, flags=re.IGNORECASE # Remove this line if you need case-sensitive matching ) # Step 3: Convert lists to comma-separated strings, handle no matches case_content['wordFound'] = case_content['wordFound'].apply( lambda matches: ', '.join(sorted(set(matches))) # Sort for consistency (optional) ) case_content['wordFound'] = case_content['wordFound'].replace('', 'no match') # Replace empty strings if desired
Why This Works So Fast
str.findallis a vectorized operation, meaning it processes the entire column at once using optimized C-backed code—just likestr.contains(). Your 20k+ rows will be done in seconds, not minutes.- Using
set()twice ensures you don’t get duplicate entries in yourwordFoundcolumn. re.escapeprotects you from accidental regex behavior if your keywords include special characters (e.g., a keyword likeString.Onewould match "StringXOne" without escaping).
Optional Tweaks
- Skip the
sorted()call if you don’t care about the order of matched keywords to save a tiny bit of time. - Delete
flags=re.IGNORECASEif case sensitivity is required.
内容的提问来源于stack exchange,提问作者Sand20




