编写Python函数：生成DataFrame列和并对比两DataFrame差异

阿华AIGC实验室

2026-5-9

Solution: Compare DataFrame Column Sums with Expected Totals

Got it, let's build a practical pandas-based function that checks off both your requirements—calculating column sums and validating them against your expected totals. I'll break this down step by step so it's easy to follow and adapt.

Step 1: Calculate Column Sums & Generate df1

First, we'll take your input DataFrame, compute sums for each numeric column, and append a "Total" row to create df1. Here's the code with comments to explain each part:

import pandas as pd

def calculate_and_validate_totals(input_df, df2):
    # Step 1: Compute column sums and build df1 with Total row
    # Calculate sums for all numeric columns (skips non-numeric like 'name')
    column_sums = input_df.sum(numeric_only=True)
    # Add the 'name' label for our Total row
    column_sums['name'] = 'Total'
    # Append the Total row to the original DataFrame to get df1
    df1 = input_df.append(column_sums.to_frame().T, ignore_index=True)

    # Step 2: Compare df1's Total row with df2's True_Total row
    # Extract the two rows we need for comparison
    total_row = df1[df1['name'] == 'Total'].squeeze()
    true_total_row = df2[df2['name'] == 'True_Total'].squeeze()

    # Define which columns to compare (exclude the 'name' column since it's just a label)
    compare_cols = [col for col in input_df.columns if col != 'name']

    # Track match status, differences, and mismatched columns
    match_status = {}
    difference_details = []
    mismatched_cols = []

    for col in compare_cols:
        calc_total = total_row[col]
        expected_total = true_total_row[col]
        
        if calc_total == expected_total:
            match_status[col] = "✅ Matched"
        else:
            match_status[col] = "❌ Mismatched"
            diff = calc_total - expected_total
            difference_details.append(f"Column '{col}': Calculated ({calc_total}) vs Expected ({expected_total}) | Difference: {diff}")
            mismatched_cols.append(col)

    # Print the results in a readable format
    print("=== Match Status Summary ===")
    for col, status in match_status.items():
        print(f"{col}: {status}")

    if difference_details:
        print("\n=== Difference Breakdown ===")
        for detail in difference_details:
            print(detail)
        
        print("\n=== Mismatched Columns ===")
        print(f"Columns with inconsistent totals: {', '.join(mismatched_cols)}")
    else:
        print("\n🎉 All columns match the expected True_Total values!")

    # Return df1 in case you need to use it for further analysis
    return df1

How to Test the Function

Let's use your example data to verify everything works as expected:

# Example input DataFrame (matches your structure with T, N1, N2 rows)
input_data = {
    'name': ['T', 'N1', 'N2'],
    'name1': [100, 10, 134],  # Sum: 244
    'name2': [50, 22, 50],    # Sum: 122
    'name3': [80, 62, 40],    # Sum: 182
    'name4': [30, 29, 20],    # Sum: 79
    'name5': [70, 49, 50]     # Sum: 169
}
input_df = pd.DataFrame(input_data)

# Example df2 with True_Total row
df2_data = {
    'name': ['True_Total'],
    'name1': [244],
    'name2': [122],
    'name3': [182],
    'name4': [79],
    'name5': [169]
}
df2 = pd.DataFrame(df2_data)

# Run the function
df1_result = calculate_and_validate_totals(input_df, df2)

Expected Output

When you run the test code, you'll see:

=== Match Status Summary ===
name1: ✅ Matched
name2: ✅ Matched
name3: ✅ Matched
name4: ✅ Matched
name5: ✅ Matched

🎉 All columns match the expected True_Total values!

If you tweak one of the expected values (e.g., set name1 to 245 in df2), the output will flag the mismatch clearly:

=== Match Status Summary ===
name1: ❌ Mismatched
name2: ✅ Matched
name3: ✅ Matched
name4: ✅ Matched
name5: ✅ Matched

=== Difference Breakdown ===
Column 'name1': Calculated (244) vs Expected (245) | Difference: -1

=== Mismatched Columns ===
Columns with inconsistent totals: name1

Quick Notes

The numeric_only=True flag ensures we don't try to sum non-numeric columns like 'name'.
squeeze() converts single-row DataFrames into Series, making column-wise comparison simpler.
You can adjust the output formatting or return results as variables instead of printing if you need to integrate this into a larger script.

内容的提问来源于stack exchange，提问作者Richard21