如何利用Pandas DataFrame避免使用for循环？——优化moving

如何利用Pandas DataFrame避免使用for循环？——优化moving_window函数循环效率的技术问询

阿华AIGC实验室

2026-4-30

Optimizing Your Moving Window Logic in Pandas

Great question! That iterrows() loop is a classic bottleneck—pandas wasn't designed for row-by-row iteration, and as your dataset grows, this approach will get slower and slower. Let's refactor this to leverage pandas' vectorized operations and rolling window capabilities for a massive efficiency boost.

What's Wrong with the Original Code?

iterrows() has significant overhead compared to vectorized operations.
You're re-slicing comb_df for every row in input_df, which means recalculating overlapping window data hundreds (or thousands) of times.
Building the results Series by assigning to .loc[index] one row at a time causes repeated memory reallocations.

Optimized Implementation

We'll use pandas' built-in rolling() with a time-based window, which handles the window slicing efficiently under the hood:

def moving_window(self, input_df: pd.DataFrame, previous_df: pd.DataFrame):
    comb_df = pd.concat([previous_df, input_df])
    
    # Convert integer timestamp indices to datetime (assuming your timestamps are in milliseconds)
    comb_df.index = pd.to_datetime(comb_df.index, unit='ms')
    
    # Create a rolling window matching your original logic: 
    # window of `self.duration` seconds, closed on both start and end timestamps
    rolling_window = comb_df.rolling(window=f"{self.duration}s", closed='both')
    
    # Apply your custom function to each window
    # Use raw=False since `another_function` expects a DataFrame
    full_results = rolling_window.apply(self.another_function, raw=False)
    
    # Filter results to only keep entries corresponding to input_df's original indices
    input_dt_index = pd.to_datetime(input_df.index, unit='ms')
    results = full_results.loc[input_dt_index]
    
    # Convert the datetime index back to integer milliseconds if needed
    results.index = results.index.astype(int) // 10**6
    
    return results

Why This Works Better

Vectorized Window Calculation: Pandas optimizes rolling window operations to avoid redundant computations—overlapping windows share data instead of re-slicing from scratch.
No Explicit Loops: The rolling().apply() call replaces your manual iterrows() loop, eliminating Python-level iteration overhead.
Memory Efficiency: Building the results in one go instead of row-by-row avoids repeated memory reallocations.

Key Notes

Timestamp Unit: Adjust the unit parameter in pd.to_datetime() if your timestamps aren't in milliseconds (e.g., unit='s' for seconds).
another_function Compatibility: Ensure another_function accepts a DataFrame (the window subset) and returns a scalar value. If you can rewrite another_function to use pandas' built-in vectorized functions instead of custom logic, you'll get even more speed.
Window Closure: The closed='both' parameter matches your original logic where both the start and end timestamps are included in the window.

Bonus: Even More Speed (If Possible)

If another_function can be rewritten to use pandas' built-in aggregation functions (like sum(), mean(), or custom vectorized operations), skip apply() entirely and use direct rolling aggregations. For example:

# Replace this if your logic allows it
full_results = rolling_window.mean()  # Or any built-in aggregation

This will be orders of magnitude faster than apply() since it avoids calling a Python function for each window.

内容的提问来源于stack exchange，提问作者gammawind

火山引擎最新活动

方舟 Coding Plan

HOT

模型自由，工具不限，免费解锁 ArkClaw，7*24 小时在线的专属智能伙伴

查看详情

一键部署 OpenClaw

分钟级部署，云服务器包月低至￥9.9，与 CodingPlan 组合购买仅需19.8元

查看详情

Seedance2.0 体验中心上线

注册即享免费500万Tokens，抢先领略新一代AI视频技术跃迁

查看详情

新用户特惠专场

大模型19元起，Al应用9.9元畅享，新人首购爆款尽享优惠

查看详情

ArkClaw 专属智能伙伴