如何在Pandas DataFrame中高效匹配关键词并返回具体匹配词

阿华AIGC实验室

2026-5-8

Efficiently Extract Matched Keywords from Pandas DataFrame Column

Got it, let's fix this without sacrificing speed—we can keep the efficiency of vectorized operations while capturing the exact keywords that match each row.

The problem with your iterows() approach is that it’s looping through every row in pure Python, which is glacial for large datasets. Instead, we can use Pandas’ built-in vectorized string methods (the same engine that makes str.contains() fast) combined with regex to extract all matching keywords in one go.

Step-by-Step Solution

Clean up your keyword list: First, remove duplicates from wordList (you’ve got repeats like StringSix and StringSeven) to avoid redundant matches.
Build a safe regex pattern: Escape any special characters in your keywords (like ., *, or +) to ensure they match literally, then join them into a regex pattern.
Extract matches with str.findall: This vectorized method finds all keywords present in each CASE_DESCRIPTION entry.
Format results: Convert the list of matches into a comma-separated string, and handle empty matches if needed.

Here’s the code:

import pandas as pd
import re

# Your existing data and keyword list
# case_content = pd.DataFrame(...)
# wordList = ['StringOne','StringTwo','StringThree','StringFour','StringFive','StringSix','StringSeven','StringSix','StringSeven','StringEight','StringNine','StringTen']

# Step 1: Remove duplicate keywords and escape special characters
unique_keywords = list(set(wordList))
safe_pattern = '|'.join([re.escape(word) for word in unique_keywords])

# Step 2: Extract all matching keywords for each row
case_content['wordFound'] = case_content['CASE_DESCRIPTION'].str.findall(
    safe_pattern,
    flags=re.IGNORECASE  # Remove this line if you need case-sensitive matching
)

# Step 3: Convert lists to comma-separated strings, handle no matches
case_content['wordFound'] = case_content['wordFound'].apply(
    lambda matches: ', '.join(sorted(set(matches)))  # Sort for consistency (optional)
)
case_content['wordFound'] = case_content['wordFound'].replace('', 'no match')  # Replace empty strings if desired

Why This Works So Fast

str.findall is a vectorized operation, meaning it processes the entire column at once using optimized C-backed code—just like str.contains(). Your 20k+ rows will be done in seconds, not minutes.
Using set() twice ensures you don’t get duplicate entries in your wordFound column.
re.escape protects you from accidental regex behavior if your keywords include special characters (e.g., a keyword like String.One would match "StringXOne" without escaping).

Optional Tweaks

Skip the sorted() call if you don’t care about the order of matched keywords to save a tiny bit of time.
Delete flags=re.IGNORECASE if case sensitivity is required.

内容的提问来源于stack exchange，提问作者Sand20