Python新手求助:用Pandas按10分钟频率计算2个月数据平均值
Hey there! Since you’ve already got the basics down—setting a datetime index, grouping transactions into 10-minute windows, and retaining the Respuesta, OperationId, and SucursalId columns—let’s build out that average baseline using your 2 months of daily appended data. Here’s a step-by-step breakdown tailored to your needs:
Step 1: Combine Your Daily Appended Data
First, we need to bring all your daily datasets into a single DataFrame. If your data is stored in separate files (e.g., CSV files named like transactions_2024-01-01.csv), use this code to concatenate them:
import pandas as pd import glob # Grab all daily data files (adjust the file pattern to match your naming) daily_files = glob.glob('transactions_*.csv') # Combine all files into one DataFrame full_dataset = pd.concat([pd.read_csv(file) for file in daily_files], ignore_index=True) # Ensure your timestamp column is converted to datetime and set as the index # Replace 'timestamp_column' with your actual time column name full_dataset['timestamp_column'] = pd.to_datetime(full_dataset['timestamp_column']) full_dataset.set_index('timestamp_column', inplace=True)
Step 2: Choose Your Baseline Type
There are two common types of 10-minute baselines—pick the one that fits your use case:
Option 1: Global 10-Minute Window Averages
This calculates the average for every unique 10-minute interval across your entire 2-month dataset (e.g., average values for 2024-01-01 09:00-09:10, 2024-01-01 09:10-09:20, etc.):
# Resample to 10-minute intervals and compute mean for your target columns global_baseline = full_dataset[['Respuesta', 'OperationId', 'SucursalId']].resample('10T').mean()
resample('10T')groups your data into 10-minute chunks (Tstands for minutes)mean()computes the average for each chunk
Option 2: Time-of-Day 10-Minute Baselines
If you want a baseline that represents the average value for the same 10-minute window every day (e.g., average of all 09:00-09:10 transactions across all 2 months), use this approach:
# Extract the hour-minute portion of your timestamp to group by time of day full_dataset['time_window'] = full_dataset.index.strftime('%H:%M') # Group by the 10-minute time window and compute the average time_of_day_baseline = full_dataset.groupby('time_window')[['Respuesta', 'OperationId', 'SucursalId']].mean() # Optional: Sort the baseline by time for readability time_of_day_baseline.index = pd.to_datetime(time_of_day_baseline.index, format='%H:%M').time time_of_day_baseline.sort_index(inplace=True)
This gives you a reusable baseline that you can compare against new daily data (e.g., check if today’s 09:00-09:10 Respuesta values are above or below the 2-month average for that window).
Step 3: Handle Edge Cases
- Missing Values: If your dataset has gaps, clean them up before computing averages to avoid skewed results:
full_dataset = full_dataset.dropna(subset=['Respuesta', 'OperationId', 'SucursalId']) - Irregular Timestamps: If your data doesn’t align perfectly with 10-minute windows,
resample()will automatically fill in missing intervals withNaN—you can usefillna(method='ffill')orfillna(0)if needed, depending on your data.
Step 4: Use the Baseline with New Data
To compare new daily transactions against your baseline (using the time-of-day example):
# Load and prepare new daily data new_daily_data = pd.read_csv('new_transactions.csv') new_daily_data['timestamp_column'] = pd.to_datetime(new_daily_data['timestamp_column']) new_daily_data.set_index('timestamp_column', inplace=True) new_daily_data['time_window'] = new_daily_data.index.strftime('%H:%M') # Merge the baseline with new data new_data_with_baseline = new_daily_data.merge( time_of_day_baseline, on='time_window', suffixes=('_actual', '_baseline') )
Let me know if you need to adjust this for specific edge cases (like splitting baseline by SucursalId or handling weekend vs weekday differences)!
内容的提问来源于stack exchange,提问作者jufegome




