如何使用Pandas的drop_duplicates方法实现基于Name字段长度的条件去重?
drop_duplicates Absolutely! You can achieve this requirement using drop_duplicates—you just need to combine it with conditional filtering, since drop_duplicates doesn’t support conditional deduplication directly in a single call. Let’s break this down with your sample DataFrame.
First, let’s recap your data and requirement:
Your DataFrame
df:
Name State Down NY Down NY Down NY Next In NJ Next In NJ Next In NJ Requirement: Only deduplicate rows where the
Namefield’s length exceeds 5 characters (so keep all "Down" rows, but reduce "Next In" to one unique row).
Solution 1: Split, Deduplicate, and Combine
The most straightforward approach is to split your DataFrame into two subsets, process each, then recombine:
- Create a boolean mask to identify rows where
Namelength >5:mask = df['Name'].str.len() > 5 - Deduplicate only the rows that match the mask (using
drop_duplicateson theNamecolumn):deduplicated_rows = df[mask].drop_duplicates(subset='Name', keep='first') - Keep all rows that don’t match the mask (no deduplication needed):
keep_all_rows = df[~mask] - Combine the two subsets and restore the original order (optional):
result_df = pd.concat([keep_all_rows, deduplicated_rows]).sort_index()
Solution 2: Use groupby with Conditional Logic
If you prefer a more concise approach, you can use groupby to apply deduplication only to groups where Name meets the length condition:
result_df = df.groupby('Name', group_keys=False).apply( lambda group: group.drop_duplicates() if group['Name'].str.len().iloc[0] > 5 else group ).reset_index(drop=True)
What You’ll Get
Both methods will produce this final DataFrame:
| Name | State |
|---|---|
| Down | NY |
| Down | NY |
| Down | NY |
| Next In | NJ |
Which perfectly aligns with your requirement: all short-named rows are preserved, while long-named duplicates are removed.
内容的提问来源于stack exchange,提问作者Pav




