Python实现带权重的多客户数据集模糊匹配去重及工具咨询

阿华AIGC实验室

2026-5-27

Weighted Fuzzy Matching for Customer Deduplication in Python

Let’s break this down into a practical, actionable plan—since you’re dealing with large datasets (13k-18k records each) and need weighted field matching, the key is to balance accuracy with performance. Here’s how to approach it:

Step 1: Define Field Weights First

Start by assigning weights based on how uniquely each field identifies a customer. For your dataset, a reasonable weight distribution might look like this (sum to 1 for easy thresholding):

Phone/Mobile: 0.2 each (total 0.4) — phone numbers are highly unique, even with minor typos.
CustomerName: 0.3 — names are core identifiers but prone to typos (e.g., "John Doe" vs "Jon Doe").
Address: 0.15 — addresses have formatting variations but are strong indicators of the same customer.
Longitude/Latitude: 0.1 — geographic proximity is useful if coordinates are precise.
StoreName: 0.05 — low weight, since multiple customers can visit the same store.

Adjust these based on your data’s quality (e.g., if phone numbers are often missing, reduce their weight).

Step 2: Preprocess Data to Improve Matching

Clean every field to eliminate noise before matching:

Strings (Name, Address, StoreName): Lowercase, remove special characters, standardize abbreviations (e.g., "St." → "Street").
Phone/Mobile: Strip non-numeric characters (spaces, hyphens) to standardize formats.
Geo fields: Ensure coordinates are numeric and handle missing values gracefully.

Step 3: Choose a Matching Strategy with Blocking

Pairwise comparisons of all records would be O(n²) (18k records = ~162 million pairs)—way too slow. Use blocking to group records by a shared attribute first (e.g., first 3 letters of the customer name, or area code of the phone number) to only compare relevant pairs.

Step 4: Calculate Weighted Similarity Scores

For each field type, use the right similarity metric, multiply by its weight, then sum all scores to get a total similarity score. If the score exceeds a threshold (e.g., 0.7), mark the records as duplicates.

Top Python Libraries for This Task

These tools are tailored to handle weighted fuzzy matching and large datasets:

1. RapidFuzz

A fast, drop-in replacement for FuzzyWuzzy (written in C) that calculates string similarity (Levenshtein, Jaccard, etc.). Perfect for name/address matching without sacrificing speed.

2. RecordLinkage

A dedicated library for record deduplication and linkage. It supports blocking out of the box, lets you assign weights to field comparisons, and handles different data types (strings, exact matches, geographic distances) seamlessly. Great for structured datasets like yours.

3. Dedupe

Uses machine learning to learn optimal weights automatically from labeled data. If you have a small sample of known duplicates/non-duplicates, it will train a model to score pairs—saving you from manual weight tuning.

4. Geopy

Calculates geographic distances between longitude/latitude points (Haversine formula). Convert distance to a similarity score (e.g., 1 - (distance / max_relevant_distance)) to include location in your weighted score.

5. Pandas

Essential for data cleaning, grouping, and merging. You’ll use it to preprocess your datasets and manage the results of matching.

Example Workflow with RecordLinkage

Here’s a quick snippet to show how this comes together:

First, install the libraries:

pip install recordlinkage pandas geopy

Then, implement the matching:

import pandas as pd
import recordlinkage
from geopy.distance import geodesic

# Load your datasets
df1 = pd.read_csv("dataset1.csv")
df2 = pd.read_csv("dataset2.csv")
df3 = pd.read_csv("dataset3.csv")

# Preprocess fields (example for name and phone)
def clean_string(s):
    return s.lower().replace(",", "").replace(".", "").strip() if pd.notna(s) else ""

def clean_phone(p):
    return "".join(filter(str.isdigit, str(p))) if pd.notna(p) else ""

for df in [df1, df2, df3]:
    df["CleanName"] = df["CustomerName"].apply(clean_string)
    df["CleanPhone"] = df["Phone"].apply(clean_phone)
    df["CleanMobile"] = df["Mobile"].apply(clean_phone)
    df["CleanAddress"] = df["Address"].apply(clean_string)

# Combine all datasets into one for deduplication
combined_df = pd.concat([df1, df2, df3]).reset_index(drop=True)

# Set up blocking to reduce comparisons
indexer = recordlinkage.Index()
indexer.block("CleanName")  # Group by first few characters of cleaned name
candidates = indexer.index(combined_df)

# Define comparisons with weights
comparer = recordlinkage.Compare()
comparer.string("CleanName", "CleanName", method="levenshtein", threshold=0.8, weight=0.3)
comparer.exact("CleanPhone", "CleanPhone", weight=0.2)
comparer.exact("CleanMobile", "CleanMobile", weight=0.2)
comparer.string("CleanAddress", "CleanAddress", method="jaccard", threshold=0.7, weight=0.15)
# Add geo comparison (convert distance to score)
comparer.geodistance("Latitude", "Longitude", "Latitude", "Longitude", threshold=5, weight=0.1)

# Calculate weighted scores
features = comparer.compute(candidates, combined_df, combined_df)
features["TotalScore"] = features.sum(axis=1)

# Filter duplicates (adjust threshold based on your data)
duplicates = features[features["TotalScore"] >= 0.7]

# Now you can merge or remove duplicates from combined_df

Key Tips

Tune the Threshold: Test thresholds between 0.6 and 0.8 to find the right balance between false positives and negatives.
Handle Missing Data: Ensure missing fields don’t skew scores (e.g., assign a 0 similarity score for missing phone numbers).
Use Labeled Data: If possible, use Dedupe to train a model—this will give you more accurate weightings than manual assignment.

内容的提问来源于stack exchange，提问作者Dr Sima