You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

Python Pandas中均衡多分类DataFrame样本行数的优化方案问询

Handling Class Imbalance in Sentiment Analysis with Scalable Sampling

Hey there! Your manual slicing approach works fine for a small set of classes like 1-5 stars, but I totally get why you’d want a better solution when dealing with 100+ categories—writing repetitive code for every class is tedious and error-prone. Let’s fix that with clean, scalable methods that adapt to any number of classes.

First, let’s recap your dataset’s imbalance for context:

When you run print(df.groupby('overall').count()), you get:

overall  reviewText
1.0      108725
2.0      82139
3.0      142257
4.0      347041
5.0      1009026

5-star reviews make up 59% of the data, which would skew your model toward positive predictions—so balancing classes is a smart call.

Scalable Solution 1: Slice Top N Samples per Class (Matches Your Original Logic)

Instead of manually slicing each class, use groupby() combined with apply() to automate the process. This works for any number of classes, no extra code needed:

import pandas as pd

# Define your target sample size per class
max_samples_per_class = 80000

# Automatically slice each class to the max sample size (or all samples if fewer exist)
balanced_df = df.groupby('overall').apply(lambda group: group.head(max_samples_per_class)).reset_index(drop=True)

Scalable Solution 2: Random Sampling per Class (Avoids Order Bias)

If your dataset is ordered (e.g., reviews sorted by date), taking the first N samples might introduce bias. Instead, randomly sample up to your target size from each class:

# Randomly sample up to max_samples from each class (preserves all samples if class is smaller than max)
balanced_df = df.groupby('overall').apply(
    lambda group: group.sample(n=min(len(group), max_samples_per_class), random_state=42)
).reset_index(drop=True)

The random_state=42 ensures your sampling is reproducible—feel free to remove it if you don’t need consistent results across runs.

Bonus: Alternative to Resampling

If modifying your dataset feels restrictive (like discarding valuable data from large classes), you can adjust your model to account for imbalance directly. Most scikit-learn models support a class_weight='balanced' parameter, which automatically assigns higher weights to underrepresented classes during training. For example:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(class_weight='balanced')

This is great if you want to keep all your data while mitigating bias.

内容的提问来源于stack exchange,提问作者Brice Frisco

火山引擎 最新活动