You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python跨DataFrame企业名称匹配技术问询:基于相似词汇识别的文本匹配需求

Partial Company Name Matching Between Two DataFrames

Hey there! Since you're new to text processing and NLP, let's walk through how to solve this partial name matching problem—this is a super common task in data cleaning, so you're in good company.

First, let's break down the core issue: your company names have inconsistencies like different suffixes (PLC vs plc comp), special characters (/,-,{}), and partial wording (Etihad company vs ETIHAD/LUFTHANSA/HD LLC). Direct exact matches won't work here, so we need two key steps: text preprocessing to standardize names, then fuzzy/partial matching to find overlaps.


Step 1: Standardize the Company Names (Preprocessing)

First, we'll clean both columns to remove noise that throws off matching. Here's a simple preprocessing function that handles most common issues:

import pandas as pd
from fuzzywuzzy import fuzz, process  # We'll use this for fuzzy matching later

def clean_company_name(name):
    # 1. Convert all text to lowercase to avoid case sensitivity
    cleaned = name.lower()
    # 2. Remove special characters (keep only letters, spaces, numbers)
    cleaned = ''.join([char for char in cleaned if char.isalnum() or char == ' '])
    # 3. Standardize common business suffixes (replace with empty string to focus on core names)
    common_suffixes = [
        'plc', 'corp', 'llc', 'ltd', 'corporation', 'company', 
        'comp', 'group', 'studio', 'entertainment', 'joint venture', 'strategic alliance'
    ]
    for suffix in common_suffixes:
        cleaned = cleaned.replace(suffix, '')
    # 4. Remove extra spaces
    cleaned = ' '.join(cleaned.split())
    return cleaned

# Apply cleaning to both DataFrames
df1['Cleaned_All_Name'] = df1['All_name'].apply(clean_company_name)
df2['Cleaned_Company'] = df2['Company'].apply(clean_company_name)

After this, your cleaned names will look much more consistent:

  • ETIHAD/LUFTHANSA/HD LLC becomes etihad lufthansa hd
  • Rolls-Royce plc comp becomes rollsroyce

Step 2: Fuzzy Matching to Find Matches

Now we'll use fuzzy string matching to compare the cleaned names. Fuzzy matching calculates how "close" two strings are (using the Levenshtein distance algorithm), which is perfect for partial matches.

We'll use the fuzzywuzzy library (install it first with pip install fuzzywuzzy python-Levenshtein—the latter speeds up the calculations).

Here's how to filter DF2 to only keep rows where the company has a match in DF1:

def has_match(cleaned_company, df1_cleaned_names, threshold=70):
    # Find the closest match in DF1's cleaned names
    best_match = process.extractOne(cleaned_company, df1_cleaned_names)
    # Check if the match score meets our threshold (0-100, higher = more similar)
    if best_match and best_match[1] >= threshold:
        return True
    return False

# Filter DF2 to get matching records
df_result = df2[df2['Cleaned_Company'].apply(lambda x: has_match(x, df1['Cleaned_All_Name'].tolist()))]

# Drop the cleaned column if you don't need it in the final output
df_result = df_result.drop('Cleaned_Company', axis=1)

# Print the result
print(df_result)

Adjusting the Threshold

The threshold=70 is a starting point. If you're getting too many false positives (matches that shouldn't be there), increase the threshold (try 80 or 85). If you're missing valid matches, lower it (try 60 or 65). Test with your data to find the sweet spot.


Alternative: Keyword-Based Matching

If you want a simpler approach (no external libraries), you can use keyword intersection. For example:

  1. Split each cleaned name into individual words (keywords)
  2. For each company in DF2, check if any of its keywords exist in any cleaned name from DF1

Here's a quick example:

# Create a set of all keywords from DF1's cleaned names
df1_keywords = set()
for name in df1['Cleaned_All_Name']:
    df1_keywords.update(name.split())

# Filter DF2: keep rows where at least one keyword matches DF1's keywords
def has_keyword_match(cleaned_company):
    company_keywords = set(cleaned_company.split())
    return len(company_keywords & df1_keywords) > 0

df_result = df2[df2['Cleaned_Company'].apply(has_keyword_match)]

This is faster for large datasets but might be less precise than fuzzy matching—great if you just need to catch obvious overlaps.


Key Notes for Beginners

  • Preprocessing is critical: Skipping this step will lead to way more missed matches or false positives.
  • Test with small subsets: Before running on your full large DataFrames, test the preprocessing and matching logic on a small sample to tweak thresholds and cleaning rules.
  • For very large data: Use rapidfuzz (a faster alternative to fuzzywuzzy) or combine keyword filtering with fuzzy matching to reduce the number of comparisons.

内容的提问来源于stack exchange,提问作者Anna Shevtsova

火山引擎 最新活动