Python Pandas中均衡多分类DataFrame样本行数的优化方案问询

阿华AIGC实验室

2026-5-7

Handling Class Imbalance in Sentiment Analysis with Scalable Sampling

Hey there! Your manual slicing approach works fine for a small set of classes like 1-5 stars, but I totally get why you’d want a better solution when dealing with 100+ categories—writing repetitive code for every class is tedious and error-prone. Let’s fix that with clean, scalable methods that adapt to any number of classes.

First, let’s recap your dataset’s imbalance for context:

When you run print(df.groupby('overall').count()), you get:
overall  reviewText
1.0      108725
2.0      82139
3.0      142257
4.0      347041
5.0      1009026
5-star reviews make up 59% of the data, which would skew your model toward positive predictions—so balancing classes is a smart call.

Scalable Solution 1: Slice Top N Samples per Class (Matches Your Original Logic)

Instead of manually slicing each class, use groupby() combined with apply() to automate the process. This works for any number of classes, no extra code needed:

import pandas as pd

# Define your target sample size per class
max_samples_per_class = 80000

# Automatically slice each class to the max sample size (or all samples if fewer exist)
balanced_df = df.groupby('overall').apply(lambda group: group.head(max_samples_per_class)).reset_index(drop=True)

Scalable Solution 2: Random Sampling per Class (Avoids Order Bias)

If your dataset is ordered (e.g., reviews sorted by date), taking the first N samples might introduce bias. Instead, randomly sample up to your target size from each class:

# Randomly sample up to max_samples from each class (preserves all samples if class is smaller than max)
balanced_df = df.groupby('overall').apply(
    lambda group: group.sample(n=min(len(group), max_samples_per_class), random_state=42)
).reset_index(drop=True)

The random_state=42 ensures your sampling is reproducible—feel free to remove it if you don’t need consistent results across runs.

Bonus: Alternative to Resampling

If modifying your dataset feels restrictive (like discarding valuable data from large classes), you can adjust your model to account for imbalance directly. Most scikit-learn models support a class_weight='balanced' parameter, which automatically assigns higher weights to underrepresented classes during training. For example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')

This is great if you want to keep all your data while mitigating bias.

内容的提问来源于stack exchange，提问作者Brice Frisco

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴