编写Python脚本:基于主键比对两个CSV文件并验证对应数据差异
Python Script to Compare Two CSV Files by Primary Key
Got it, let's put together a practical, easy-to-use Python script that compares two CSV files using a primary key. This solution will:
- Validate that the primary key is unique in both files
- Identify records that exist only in the first CSV or only in the second
- Flag data inconsistencies between matching records (same primary key but different values in other columns)
- Generate a human-readable report of all differences
First, make sure you have pandas installed (it's perfect for handling CSV data efficiently):
pip install pandas
Here's the complete script with detailed comments:
import pandas as pd from typing import Union, List def compare_csvs( file1_path: str, file2_path: str, primary_key: Union[str, List[str]], output_report_path: str = "csv_comparison_report.txt" ) -> None: """ Compare two CSV files based on a primary key, identify missing records and data mismatches. Args: file1_path: Path to the first CSV file file2_path: Path to the second CSV file primary_key: Single column name (string) or list of column names for composite primary key output_report_path: Path to save the comparison report """ # Load CSV files into DataFrames df1 = pd.read_csv(file1_path) df2 = pd.read_csv(file2_path) # Validate primary key exists in both DataFrames if isinstance(primary_key, str): primary_key = [primary_key] missing_in_df1 = [key for key in primary_key if key not in df1.columns] missing_in_df2 = [key for key in primary_key if key not in df2.columns] if missing_in_df1 or missing_in_df2: raise ValueError( f"Primary key columns missing:\n" f"Missing in {file1_path}: {missing_in_df1}\n" f"Missing in {file2_path}: {missing_in_df2}" ) # Validate primary key is unique in both files (since it's a primary key) if not df1.duplicated(subset=primary_key).sum() == 0: raise ValueError(f"Duplicate primary key values found in {file1_path}") if not df2.duplicated(subset=primary_key).sum() == 0: raise ValueError(f"Duplicate primary key values found in {file2_path}") # Merge DataFrames to find matches and missing records merged_df = pd.merge( df1, df2, on=primary_key, how="outer", suffixes=("_file1", "_file2"), indicator=True ) # Split into categories only_in_file1 = merged_df[merged_df["_merge"] == "left_only"] only_in_file2 = merged_df[merged_df["_merge"] == "right_only"] matching_records = merged_df[merged_df["_merge"] == "both"] # Prepare the report report = [] report.append("=== CSV Comparison Report ===\n") report.append(f"Files compared: {file1_path} vs {file2_path}") report.append(f"Primary key: {', '.join(primary_key)}\n") # Report missing records report.append(f"\n--- Records only in {file1_path} ---") if len(only_in_file1) == 0: report.append("None") else: report.append(f"Total: {len(only_in_file1)}") report.append(only_in_file1[primary_key].to_string(index=False)) report.append(f"\n--- Records only in {file2_path} ---") if len(only_in_file2) == 0: report.append("None") else: report.append(f"Total: {len(only_in_file2)}") report.append(only_in_file2[primary_key].to_string(index=False)) # Report data mismatches in matching records report.append("\n--- Data Mismatches in Matching Records ---") mismatches_found = False # Get columns that exist in both files (excluding primary key) common_columns = [col.split("_file1")[0] for col in matching_records.columns if "_file1" in col] for col in common_columns: col_file1 = f"{col}_file1" col_file2 = f"{col}_file2" # Find rows where values don't match (handle NaNs properly) mismatched_rows = matching_records[ ~matching_records[col_file1].eq(matching_records[col_file2], na_equal=False) ] if len(mismatched_rows) > 0: mismatches_found = True report.append(f"\nColumn: {col}") report.append(f"Mismatched records: {len(mismatched_rows)}") # Show primary key and both values mismatch_details = mismatched_rows[primary_key + [col_file1, col_file2]] report.append(mismatch_details.to_string(index=False)) if not mismatches_found: report.append("None - All matching records have consistent data") # Save the report to file with open(output_report_path, "w", encoding="utf-8") as f: f.write("\n".join(report)) print(f"Comparison complete! Report saved to {output_report_path}") # Example usage if __name__ == "__main__": try: compare_csvs( file1_path="file1.csv", file2_path="file2.csv", primary_key="user_id", # Replace with your primary key column(s) output_report_path="comparison_report.txt" ) except Exception as e: print(f"Error during comparison: {str(e)}")
Key Features Explained:
- Composite Primary Key Support: You can pass a list of columns (e.g.,
["order_id", "product_id"]) if your primary key uses multiple columns. - Validation Checks: Ensures the primary key exists in both files and has no duplicates (critical for accurate comparison).
- Comprehensive Report: The report includes:
- Records missing from each file
- Exact mismatches in each column for matching records
- Clear counts and readable details
- NaN Handling: Properly compares values even if some fields are empty (NaN values are treated as mismatches unless both are NaN).
How to Use:
- Replace the
file1.csv,file2.csv, anduser_idin the example usage with your actual file paths and primary key column(s). - Run the script—it will generate a text report with all differences.
Notes:
- If you don't want to use pandas (e.g., for very large files where memory is an issue), you could use the built-in
csvmodule to read files line by line and store records in dictionaries. However, pandas is much more efficient for most use cases. - If you need to auto-detect the primary key, you could add logic to check which column(s) have unique values in both files. For example:
Then compare the potential keys between the two files to find a common one.def find_potential_primary_keys(df): return [col for col in df.columns if df[col].nunique() == len(df)]
内容的提问来源于stack exchange,提问作者Avinash Singh




