如何使用Pandas的drop_duplicates方法实现基于Name字段长度的条件去重？

阿华AIGC实验室

2026-4-30

Conditional Deduplication with Pandas drop_duplicates

Absolutely! You can achieve this requirement using drop_duplicates—you just need to combine it with conditional filtering, since drop_duplicates doesn’t support conditional deduplication directly in a single call. Let’s break this down with your sample DataFrame.

First, let’s recap your data and requirement:

Your DataFrame df:
Name State
Down NY
Down NY
Down NY
Next In NJ
Next In NJ
Next In NJ
Requirement: Only deduplicate rows where the Name field’s length exceeds 5 characters (so keep all "Down" rows, but reduce "Next In" to one unique row).

Solution 1: Split, Deduplicate, and Combine

The most straightforward approach is to split your DataFrame into two subsets, process each, then recombine:

Create a boolean mask to identify rows where Name length >5:
```
mask = df['Name'].str.len() > 5
```
Deduplicate only the rows that match the mask (using drop_duplicates on the Name column):
```
deduplicated_rows = df[mask].drop_duplicates(subset='Name', keep='first')
```
Keep all rows that don’t match the mask (no deduplication needed):
```
keep_all_rows = df[~mask]
```

Combine the two subsets and restore the original order (optional):

result_df = pd.concat([keep_all_rows, deduplicated_rows]).sort_index()

Solution 2: Use `groupby` with Conditional Logic

If you prefer a more concise approach, you can use groupby to apply deduplication only to groups where Name meets the length condition:

result_df = df.groupby('Name', group_keys=False).apply(
    lambda group: group.drop_duplicates() if group['Name'].str.len().iloc[0] > 5 else group
).reset_index(drop=True)