You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

编写Python脚本:基于主键比对两个CSV文件并验证对应数据差异

Python Script to Compare Two CSV Files by Primary Key

Got it, let's put together a practical, easy-to-use Python script that compares two CSV files using a primary key. This solution will:

  • Validate that the primary key is unique in both files
  • Identify records that exist only in the first CSV or only in the second
  • Flag data inconsistencies between matching records (same primary key but different values in other columns)
  • Generate a human-readable report of all differences

First, make sure you have pandas installed (it's perfect for handling CSV data efficiently):

pip install pandas

Here's the complete script with detailed comments:

import pandas as pd
from typing import Union, List

def compare_csvs(
    file1_path: str,
    file2_path: str,
    primary_key: Union[str, List[str]],
    output_report_path: str = "csv_comparison_report.txt"
) -> None:
    """
    Compare two CSV files based on a primary key, identify missing records and data mismatches.
    
    Args:
        file1_path: Path to the first CSV file
        file2_path: Path to the second CSV file
        primary_key: Single column name (string) or list of column names for composite primary key
        output_report_path: Path to save the comparison report
    """
    # Load CSV files into DataFrames
    df1 = pd.read_csv(file1_path)
    df2 = pd.read_csv(file2_path)
    
    # Validate primary key exists in both DataFrames
    if isinstance(primary_key, str):
        primary_key = [primary_key]
    
    missing_in_df1 = [key for key in primary_key if key not in df1.columns]
    missing_in_df2 = [key for key in primary_key if key not in df2.columns]
    
    if missing_in_df1 or missing_in_df2:
        raise ValueError(
            f"Primary key columns missing:\n"
            f"Missing in {file1_path}: {missing_in_df1}\n"
            f"Missing in {file2_path}: {missing_in_df2}"
        )
    
    # Validate primary key is unique in both files (since it's a primary key)
    if not df1.duplicated(subset=primary_key).sum() == 0:
        raise ValueError(f"Duplicate primary key values found in {file1_path}")
    if not df2.duplicated(subset=primary_key).sum() == 0:
        raise ValueError(f"Duplicate primary key values found in {file2_path}")
    
    # Merge DataFrames to find matches and missing records
    merged_df = pd.merge(
        df1, df2,
        on=primary_key,
        how="outer",
        suffixes=("_file1", "_file2"),
        indicator=True
    )
    
    # Split into categories
    only_in_file1 = merged_df[merged_df["_merge"] == "left_only"]
    only_in_file2 = merged_df[merged_df["_merge"] == "right_only"]
    matching_records = merged_df[merged_df["_merge"] == "both"]
    
    # Prepare the report
    report = []
    report.append("=== CSV Comparison Report ===\n")
    report.append(f"Files compared: {file1_path} vs {file2_path}")
    report.append(f"Primary key: {', '.join(primary_key)}\n")
    
    # Report missing records
    report.append(f"\n--- Records only in {file1_path} ---")
    if len(only_in_file1) == 0:
        report.append("None")
    else:
        report.append(f"Total: {len(only_in_file1)}")
        report.append(only_in_file1[primary_key].to_string(index=False))
    
    report.append(f"\n--- Records only in {file2_path} ---")
    if len(only_in_file2) == 0:
        report.append("None")
    else:
        report.append(f"Total: {len(only_in_file2)}")
        report.append(only_in_file2[primary_key].to_string(index=False))
    
    # Report data mismatches in matching records
    report.append("\n--- Data Mismatches in Matching Records ---")
    mismatches_found = False
    
    # Get columns that exist in both files (excluding primary key)
    common_columns = [col.split("_file1")[0] for col in matching_records.columns if "_file1" in col]
    
    for col in common_columns:
        col_file1 = f"{col}_file1"
        col_file2 = f"{col}_file2"
        
        # Find rows where values don't match (handle NaNs properly)
        mismatched_rows = matching_records[
            ~matching_records[col_file1].eq(matching_records[col_file2], na_equal=False)
        ]
        
        if len(mismatched_rows) > 0:
            mismatches_found = True
            report.append(f"\nColumn: {col}")
            report.append(f"Mismatched records: {len(mismatched_rows)}")
            # Show primary key and both values
            mismatch_details = mismatched_rows[primary_key + [col_file1, col_file2]]
            report.append(mismatch_details.to_string(index=False))
    
    if not mismatches_found:
        report.append("None - All matching records have consistent data")
    
    # Save the report to file
    with open(output_report_path, "w", encoding="utf-8") as f:
        f.write("\n".join(report))
    
    print(f"Comparison complete! Report saved to {output_report_path}")

# Example usage
if __name__ == "__main__":
    try:
        compare_csvs(
            file1_path="file1.csv",
            file2_path="file2.csv",
            primary_key="user_id",  # Replace with your primary key column(s)
            output_report_path="comparison_report.txt"
        )
    except Exception as e:
        print(f"Error during comparison: {str(e)}")

Key Features Explained:

  • Composite Primary Key Support: You can pass a list of columns (e.g., ["order_id", "product_id"]) if your primary key uses multiple columns.
  • Validation Checks: Ensures the primary key exists in both files and has no duplicates (critical for accurate comparison).
  • Comprehensive Report: The report includes:
    • Records missing from each file
    • Exact mismatches in each column for matching records
    • Clear counts and readable details
  • NaN Handling: Properly compares values even if some fields are empty (NaN values are treated as mismatches unless both are NaN).

How to Use:

  1. Replace the file1.csv, file2.csv, and user_id in the example usage with your actual file paths and primary key column(s).
  2. Run the script—it will generate a text report with all differences.

Notes:

  • If you don't want to use pandas (e.g., for very large files where memory is an issue), you could use the built-in csv module to read files line by line and store records in dictionaries. However, pandas is much more efficient for most use cases.
  • If you need to auto-detect the primary key, you could add logic to check which column(s) have unique values in both files. For example:
    def find_potential_primary_keys(df):
        return [col for col in df.columns if df[col].nunique() == len(df)]
    
    Then compare the potential keys between the two files to find a common one.

内容的提问来源于stack exchange,提问作者Avinash Singh

火山引擎 最新活动