You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何在Pandas DataFrame中高效匹配关键词并返回具体匹配词

Efficiently Extract Matched Keywords from Pandas DataFrame Column

Got it, let's fix this without sacrificing speed—we can keep the efficiency of vectorized operations while capturing the exact keywords that match each row.

The problem with your iterows() approach is that it’s looping through every row in pure Python, which is glacial for large datasets. Instead, we can use Pandas’ built-in vectorized string methods (the same engine that makes str.contains() fast) combined with regex to extract all matching keywords in one go.

Step-by-Step Solution

  1. Clean up your keyword list: First, remove duplicates from wordList (you’ve got repeats like StringSix and StringSeven) to avoid redundant matches.
  2. Build a safe regex pattern: Escape any special characters in your keywords (like ., *, or +) to ensure they match literally, then join them into a regex pattern.
  3. Extract matches with str.findall: This vectorized method finds all keywords present in each CASE_DESCRIPTION entry.
  4. Format results: Convert the list of matches into a comma-separated string, and handle empty matches if needed.

Here’s the code:

import pandas as pd
import re

# Your existing data and keyword list
# case_content = pd.DataFrame(...)
# wordList = ['StringOne','StringTwo','StringThree','StringFour','StringFive','StringSix','StringSeven','StringSix','StringSeven','StringEight','StringNine','StringTen']

# Step 1: Remove duplicate keywords and escape special characters
unique_keywords = list(set(wordList))
safe_pattern = '|'.join([re.escape(word) for word in unique_keywords])

# Step 2: Extract all matching keywords for each row
case_content['wordFound'] = case_content['CASE_DESCRIPTION'].str.findall(
    safe_pattern,
    flags=re.IGNORECASE  # Remove this line if you need case-sensitive matching
)

# Step 3: Convert lists to comma-separated strings, handle no matches
case_content['wordFound'] = case_content['wordFound'].apply(
    lambda matches: ', '.join(sorted(set(matches)))  # Sort for consistency (optional)
)
case_content['wordFound'] = case_content['wordFound'].replace('', 'no match')  # Replace empty strings if desired

Why This Works So Fast

  • str.findall is a vectorized operation, meaning it processes the entire column at once using optimized C-backed code—just like str.contains(). Your 20k+ rows will be done in seconds, not minutes.
  • Using set() twice ensures you don’t get duplicate entries in your wordFound column.
  • re.escape protects you from accidental regex behavior if your keywords include special characters (e.g., a keyword like String.One would match "StringXOne" without escaping).

Optional Tweaks

  • Skip the sorted() call if you don’t care about the order of matched keywords to save a tiny bit of time.
  • Delete flags=re.IGNORECASE if case sensitivity is required.

内容的提问来源于stack exchange,提问作者Sand20

火山引擎 最新活动