基于Python Pandas清理Excel数据模糊重复项及行标准化技术咨询
Hey there! Let's walk through how to tackle fuzzy duplicate cleaning and row standardization using Python and Pandas. I'll break this down step-by-step, covering the full workflow from loading your Excel data to exporting the standardized result.
First, install the libraries we'll need for handling Excel data and fuzzy matching. I recommend rapidfuzz over fuzzywuzzy since it's faster and more efficient:
pip install pandas openpyxl rapidfuzz
Start by importing Pandas and loading your dataset. We'll use openpyxl as the engine to handle .xlsx files:
import pandas as pd # Load the Excel file df = pd.read_excel("your_raw_data.xlsx", engine="openpyxl") # Quick check of the first 5 rows to verify data loading print("Original Data Preview:") print(df.head())
We'll use rapidfuzz to find similar entries in your target column (e.g., "Product Name" or "Customer Name"). The token_set_ratio scorer works well for messy text—it ignores word order and extra punctuation:
from rapidfuzz import process, fuzz from collections import defaultdict # Define the column you want to standardize target_column = "Product Name" unique_entries = df[target_column].dropna().unique().tolist() # Group similar entries similar_groups = defaultdict(list) visited_entries = set() # Adjust this threshold based on your needs (80 is a good starting point) similarity_threshold = 80 for entry in unique_entries: if entry not in visited_entries: # Find all entries that meet the similarity threshold matches = process.extract( entry, unique_entries, scorer=fuzz.token_set_ratio, score_cutoff=similarity_threshold ) matched_entries = [match[0] for match in matches] similar_groups[entry] = matched_entries visited_entries.update(matched_entries) # Print groups to review (you can tweak the threshold if groups are too broad/narrow) print("\nFuzzy Duplicate Groups:") for potential_standard, duplicates in similar_groups.items(): print(f"Standard Candidate: {potential_standard}") print(f"Similar Entries: {duplicates}\n")
Now we'll define which value in each group becomes the standard. You can either:
- Automatically use the most frequent entry (great for most cases), or
- Manually set standards if you have business-specific rules
Option 1: Auto-Generate Mapping (Most Frequent Entry)
# Get frequency count for each entry entry_frequencies = df[target_column].value_counts().to_dict() # Build the mapping: map all duplicates to the most frequent entry in their group standardization_map = {} for group in similar_groups.values(): standard_entry = max(group, key=lambda x: entry_frequencies.get(x, 0)) for duplicate in group: standardization_map[duplicate] = standard_entry
Option 2: Manual Mapping (For Custom Standards)
If auto-selection doesn't fit your needs, edit the mapping directly:
standardization_map = { "iPhone14": "Apple iPhone 14", "IPhone 14 Pro Max": "Apple iPhone 14 Pro Max", "Samsung S23 Ultra 5G": "Samsung Galaxy S23 Ultra", # Add more custom mappings here }
Apply the mapping to your DataFrame and remove exact duplicates after standardization:
# Add a new standardized column (or replace the original if preferred) df[f"{target_column} (Standardized)"] = df[target_column].map(standardization_map) # Drop exact duplicates based on the standardized column cleaned_df = df.drop_duplicates(subset=[f"{target_column} (Standardized)"]) # Preview the cleaned data print("\nCleaned & Standardized Data Preview:") print(cleaned_df[[target_column, f"{target_column} (Standardized)"]].head())
Export both the original and cleaned data to a new Excel file for documentation:
with pd.ExcelWriter("standardized_final_data.xlsx", engine="openpyxl") as writer: df.to_excel(writer, sheet_name="Original Raw Data", index=False) cleaned_df.to_excel(writer, sheet_name="Cleaned & Standardized", index=False) print("\nProcessing complete! Check 'standardized_final_data.xlsx' for results.")
Wrap everything into a function for easy reuse across different datasets:
def standardize_fuzzy_duplicates(excel_path, target_col, threshold=80): # Load data df = pd.read_excel(excel_path, engine="openpyxl") unique_entries = df[target_col].dropna().unique().tolist() # Group similar entries similar_groups = defaultdict(list) visited = set() for entry in unique_entries: if entry not in visited: matches = process.extract(entry, unique_entries, scorer=fuzz.token_set_ratio, score_cutoff=threshold) matched_entries = [match[0] for match in matches] similar_groups[entry] = matched_entries visited.update(matched_entries) # Build mapping with most frequent entry as standard freq_counts = df[target_col].value_counts().to_dict() standard_map = {} for group in similar_groups.values(): standard_entry = max(group, key=lambda x: freq_counts.get(x, 0)) for dup in group: standard_map[dup] = standard_entry # Apply standardization df[f"{target_col} (Standardized)"] = df[target_col].map(standard_map) cleaned_df = df.drop_duplicates(subset=[f"{target_col} (Standardized)"]) # Save output output_path = f"standardized_{excel_path}" with pd.ExcelWriter(output_path, engine="openpyxl") as writer: df.to_excel(writer, sheet_name="Original", index=False) cleaned_df.to_excel(writer, sheet_name="Standardized", index=False) return cleaned_df, standard_map # Example usage cleaned_data, mapping = standardize_fuzzy_duplicates("customer_data.xlsx", "Customer Name", threshold=85)
内容的提问来源于stack exchange,提问作者user10083589




