You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何使用Pandas的drop_duplicates方法实现基于Name字段长度的条件去重?

Conditional Deduplication with Pandas drop_duplicates

Absolutely! You can achieve this requirement using drop_duplicates—you just need to combine it with conditional filtering, since drop_duplicates doesn’t support conditional deduplication directly in a single call. Let’s break this down with your sample DataFrame.

First, let’s recap your data and requirement:

Your DataFrame df:

NameState
DownNY
DownNY
DownNY
Next InNJ
Next InNJ
Next InNJ

Requirement: Only deduplicate rows where the Name field’s length exceeds 5 characters (so keep all "Down" rows, but reduce "Next In" to one unique row).

Solution 1: Split, Deduplicate, and Combine

The most straightforward approach is to split your DataFrame into two subsets, process each, then recombine:

  1. Create a boolean mask to identify rows where Name length >5:
    mask = df['Name'].str.len() > 5
    
  2. Deduplicate only the rows that match the mask (using drop_duplicates on the Name column):
    deduplicated_rows = df[mask].drop_duplicates(subset='Name', keep='first')
    
  3. Keep all rows that don’t match the mask (no deduplication needed):
    keep_all_rows = df[~mask]
    
  4. Combine the two subsets and restore the original order (optional):
    result_df = pd.concat([keep_all_rows, deduplicated_rows]).sort_index()
    

Solution 2: Use groupby with Conditional Logic

If you prefer a more concise approach, you can use groupby to apply deduplication only to groups where Name meets the length condition:

result_df = df.groupby('Name', group_keys=False).apply(
    lambda group: group.drop_duplicates() if group['Name'].str.len().iloc[0] > 5 else group
).reset_index(drop=True)

What You’ll Get

Both methods will produce this final DataFrame:

NameState
DownNY
DownNY
DownNY
Next InNJ

Which perfectly aligns with your requirement: all short-named rows are preserved, while long-named duplicates are removed.


内容的提问来源于stack exchange,提问作者Pav

火山引擎 最新活动