基于Python Pandas清理Excel数据模糊重复项及行标准化技术咨询

阿华AIGC实验室

2026-5-12

Hey there! Let's walk through how to tackle fuzzy duplicate cleaning and row standardization using Python and Pandas. I'll break this down step-by-step, covering the full workflow from loading your Excel data to exporting the standardized result.

1. Setup Required Dependencies

First, install the libraries we'll need for handling Excel data and fuzzy matching. I recommend rapidfuzz over fuzzywuzzy since it's faster and more efficient:

pip install pandas openpyxl rapidfuzz

2. Load Your Excel Data

Start by importing Pandas and loading your dataset. We'll use openpyxl as the engine to handle .xlsx files:

import pandas as pd

# Load the Excel file
df = pd.read_excel("your_raw_data.xlsx", engine="openpyxl")

# Quick check of the first 5 rows to verify data loading
print("Original Data Preview:")
print(df.head())

3. Identify Fuzzy Duplicate Groups

We'll use rapidfuzz to find similar entries in your target column (e.g., "Product Name" or "Customer Name"). The token_set_ratio scorer works well for messy text—it ignores word order and extra punctuation:

from rapidfuzz import process, fuzz
from collections import defaultdict

# Define the column you want to standardize
target_column = "Product Name"
unique_entries = df[target_column].dropna().unique().tolist()

# Group similar entries
similar_groups = defaultdict(list)
visited_entries = set()

# Adjust this threshold based on your needs (80 is a good starting point)
similarity_threshold = 80

for entry in unique_entries:
    if entry not in visited_entries:
        # Find all entries that meet the similarity threshold
        matches = process.extract(
            entry, 
            unique_entries, 
            scorer=fuzz.token_set_ratio, 
            score_cutoff=similarity_threshold
        )
        matched_entries = [match[0] for match in matches]
        similar_groups[entry] = matched_entries
        visited_entries.update(matched_entries)

# Print groups to review (you can tweak the threshold if groups are too broad/narrow)
print("\nFuzzy Duplicate Groups:")
for potential_standard, duplicates in similar_groups.items():
    print(f"Standard Candidate: {potential_standard}")
    print(f"Similar Entries: {duplicates}\n")

4. Create a Standardization Mapping

Now we'll define which value in each group becomes the standard. You can either:

Automatically use the most frequent entry (great for most cases), or
Manually set standards if you have business-specific rules

Option 1: Auto-Generate Mapping (Most Frequent Entry)

# Get frequency count for each entry
entry_frequencies = df[target_column].value_counts().to_dict()

# Build the mapping: map all duplicates to the most frequent entry in their group
standardization_map = {}
for group in similar_groups.values():
    standard_entry = max(group, key=lambda x: entry_frequencies.get(x, 0))
    for duplicate in group:
        standardization_map[duplicate] = standard_entry

Option 2: Manual Mapping (For Custom Standards)

If auto-selection doesn't fit your needs, edit the mapping directly:

standardization_map = {
    "iPhone14": "Apple iPhone 14",
    "IPhone 14 Pro Max": "Apple iPhone 14 Pro Max",
    "Samsung S23 Ultra 5G": "Samsung Galaxy S23 Ultra",
    # Add more custom mappings here
}

5. Apply Standardization & Clean Duplicates

Apply the mapping to your DataFrame and remove exact duplicates after standardization:

# Add a new standardized column (or replace the original if preferred)
df[f"{target_column} (Standardized)"] = df[target_column].map(standardization_map)

# Drop exact duplicates based on the standardized column
cleaned_df = df.drop_duplicates(subset=[f"{target_column} (Standardized)"])

# Preview the cleaned data
print("\nCleaned & Standardized Data Preview:")
print(cleaned_df[[target_column, f"{target_column} (Standardized)"]].head())

6. Save the Final Results

Export both the original and cleaned data to a new Excel file for documentation:

with pd.ExcelWriter("standardized_final_data.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Original Raw Data", index=False)
    cleaned_df.to_excel(writer, sheet_name="Cleaned & Standardized", index=False)

print("\nProcessing complete! Check 'standardized_final_data.xlsx' for results.")

7. Optional: Automate the Workflow (Reusable Function)

Wrap everything into a function for easy reuse across different datasets:

def standardize_fuzzy_duplicates(excel_path, target_col, threshold=80):
    # Load data
    df = pd.read_excel(excel_path, engine="openpyxl")
    unique_entries = df[target_col].dropna().unique().tolist()
    
    # Group similar entries
    similar_groups = defaultdict(list)
    visited = set()
    for entry in unique_entries:
        if entry not in visited:
            matches = process.extract(entry, unique_entries, scorer=fuzz.token_set_ratio, score_cutoff=threshold)
            matched_entries = [match[0] for match in matches]
            similar_groups[entry] = matched_entries
            visited.update(matched_entries)
    
    # Build mapping with most frequent entry as standard
    freq_counts = df[target_col].value_counts().to_dict()
    standard_map = {}
    for group in similar_groups.values():
        standard_entry = max(group, key=lambda x: freq_counts.get(x, 0))
        for dup in group:
            standard_map[dup] = standard_entry
    
    # Apply standardization
    df[f"{target_col} (Standardized)"] = df[target_col].map(standard_map)
    cleaned_df = df.drop_duplicates(subset=[f"{target_col} (Standardized)"])
    
    # Save output
    output_path = f"standardized_{excel_path}"
    with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
        df.to_excel(writer, sheet_name="Original", index=False)
        cleaned_df.to_excel(writer, sheet_name="Standardized", index=False)
    
    return cleaned_df, standard_map

# Example usage
cleaned_data, mapping = standardize_fuzzy_duplicates("customer_data.xlsx", "Customer Name", threshold=85)

内容的提问来源于stack exchange，提问作者user10083589