编写Python脚本：基于主键比对两个CSV文件并验证对应数据差异

阿华AIGC实验室

2026-5-19

Python Script to Compare Two CSV Files by Primary Key

Got it, let's put together a practical, easy-to-use Python script that compares two CSV files using a primary key. This solution will:

Validate that the primary key is unique in both files
Identify records that exist only in the first CSV or only in the second
Flag data inconsistencies between matching records (same primary key but different values in other columns)
Generate a human-readable report of all differences

First, make sure you have pandas installed (it's perfect for handling CSV data efficiently):

pip install pandas

Here's the complete script with detailed comments:

import pandas as pd
from typing import Union, List

def compare_csvs(
    file1_path: str,
    file2_path: str,
    primary_key: Union[str, List[str]],
    output_report_path: str = "csv_comparison_report.txt"
) -> None:
    """
    Compare two CSV files based on a primary key, identify missing records and data mismatches.
    
    Args:
        file1_path: Path to the first CSV file
        file2_path: Path to the second CSV file
        primary_key: Single column name (string) or list of column names for composite primary key
        output_report_path: Path to save the comparison report
    """
    # Load CSV files into DataFrames
    df1 = pd.read_csv(file1_path)
    df2 = pd.read_csv(file2_path)
    
    # Validate primary key exists in both DataFrames
    if isinstance(primary_key, str):
        primary_key = [primary_key]
    
    missing_in_df1 = [key for key in primary_key if key not in df1.columns]
    missing_in_df2 = [key for key in primary_key if key not in df2.columns]
    
    if missing_in_df1 or missing_in_df2:
        raise ValueError(
            f"Primary key columns missing:\n"
            f"Missing in {file1_path}: {missing_in_df1}\n"
            f"Missing in {file2_path}: {missing_in_df2}"
        )
    
    # Validate primary key is unique in both files (since it's a primary key)
    if not df1.duplicated(subset=primary_key).sum() == 0:
        raise ValueError(f"Duplicate primary key values found in {file1_path}")
    if not df2.duplicated(subset=primary_key).sum() == 0:
        raise ValueError(f"Duplicate primary key values found in {file2_path}")
    
    # Merge DataFrames to find matches and missing records
    merged_df = pd.merge(
        df1, df2,
        on=primary_key,
        how="outer",
        suffixes=("_file1", "_file2"),
        indicator=True
    )
    
    # Split into categories
    only_in_file1 = merged_df[merged_df["_merge"] == "left_only"]
    only_in_file2 = merged_df[merged_df["_merge"] == "right_only"]
    matching_records = merged_df[merged_df["_merge"] == "both"]
    
    # Prepare the report
    report = []
    report.append("=== CSV Comparison Report ===\n")
    report.append(f"Files compared: {file1_path} vs {file2_path}")
    report.append(f"Primary key: {', '.join(primary_key)}\n")
    
    # Report missing records
    report.append(f"\n--- Records only in {file1_path} ---")
    if len(only_in_file1) == 0:
        report.append("None")
    else:
        report.append(f"Total: {len(only_in_file1)}")
        report.append(only_in_file1[primary_key].to_string(index=False))
    
    report.append(f"\n--- Records only in {file2_path} ---")
    if len(only_in_file2) == 0:
        report.append("None")
    else:
        report.append(f"Total: {len(only_in_file2)}")
        report.append(only_in_file2[primary_key].to_string(index=False))
    
    # Report data mismatches in matching records
    report.append("\n--- Data Mismatches in Matching Records ---")
    mismatches_found = False
    
    # Get columns that exist in both files (excluding primary key)
    common_columns = [col.split("_file1")[0] for col in matching_records.columns if "_file1" in col]
    
    for col in common_columns:
        col_file1 = f"{col}_file1"
        col_file2 = f"{col}_file2"
        
        # Find rows where values don't match (handle NaNs properly)
        mismatched_rows = matching_records[
            ~matching_records[col_file1].eq(matching_records[col_file2], na_equal=False)
        ]
        
        if len(mismatched_rows) > 0:
            mismatches_found = True
            report.append(f"\nColumn: {col}")
            report.append(f"Mismatched records: {len(mismatched_rows)}")
            # Show primary key and both values
            mismatch_details = mismatched_rows[primary_key + [col_file1, col_file2]]
            report.append(mismatch_details.to_string(index=False))
    
    if not mismatches_found:
        report.append("None - All matching records have consistent data")
    
    # Save the report to file
    with open(output_report_path, "w", encoding="utf-8") as f:
        f.write("\n".join(report))
    
    print(f"Comparison complete! Report saved to {output_report_path}")

# Example usage
if __name__ == "__main__":
    try:
        compare_csvs(
            file1_path="file1.csv",
            file2_path="file2.csv",
            primary_key="user_id",  # Replace with your primary key column(s)
            output_report_path="comparison_report.txt"
        )
    except Exception as e:
        print(f"Error during comparison: {str(e)}")

Key Features Explained:

Composite Primary Key Support: You can pass a list of columns (e.g., ["order_id", "product_id"]) if your primary key uses multiple columns.
Validation Checks: Ensures the primary key exists in both files and has no duplicates (critical for accurate comparison).
Comprehensive Report: The report includes:
- Records missing from each file
- Exact mismatches in each column for matching records
- Clear counts and readable details
NaN Handling: Properly compares values even if some fields are empty (NaN values are treated as mismatches unless both are NaN).

How to Use:

Replace the file1.csv, file2.csv, and user_id in the example usage with your actual file paths and primary key column(s).
Run the script—it will generate a text report with all differences.

Notes:

If you don't want to use pandas (e.g., for very large files where memory is an issue), you could use the built-in csv module to read files line by line and store records in dictionaries. However, pandas is much more efficient for most use cases.
If you need to auto-detect the primary key, you could add logic to check which column(s) have unique values in both files. For example:
```
def find_potential_primary_keys(df):
    return [col for col in df.columns if df[col].nunique() == len(df)]
```
Then compare the potential keys between the two files to find a common one.

内容的提问来源于stack exchange，提问作者Avinash Singh