如何利用Pandas DataFrame避免使用for循环?——优化moving_window函数循环效率的技术问询
Great question! That iterrows() loop is a classic bottleneck—pandas wasn't designed for row-by-row iteration, and as your dataset grows, this approach will get slower and slower. Let's refactor this to leverage pandas' vectorized operations and rolling window capabilities for a massive efficiency boost.
What's Wrong with the Original Code?
iterrows()has significant overhead compared to vectorized operations.- You're re-slicing
comb_dffor every row ininput_df, which means recalculating overlapping window data hundreds (or thousands) of times. - Building the
resultsSeries by assigning to.loc[index]one row at a time causes repeated memory reallocations.
Optimized Implementation
We'll use pandas' built-in rolling() with a time-based window, which handles the window slicing efficiently under the hood:
def moving_window(self, input_df: pd.DataFrame, previous_df: pd.DataFrame): comb_df = pd.concat([previous_df, input_df]) # Convert integer timestamp indices to datetime (assuming your timestamps are in milliseconds) comb_df.index = pd.to_datetime(comb_df.index, unit='ms') # Create a rolling window matching your original logic: # window of `self.duration` seconds, closed on both start and end timestamps rolling_window = comb_df.rolling(window=f"{self.duration}s", closed='both') # Apply your custom function to each window # Use raw=False since `another_function` expects a DataFrame full_results = rolling_window.apply(self.another_function, raw=False) # Filter results to only keep entries corresponding to input_df's original indices input_dt_index = pd.to_datetime(input_df.index, unit='ms') results = full_results.loc[input_dt_index] # Convert the datetime index back to integer milliseconds if needed results.index = results.index.astype(int) // 10**6 return results
Why This Works Better
- Vectorized Window Calculation: Pandas optimizes rolling window operations to avoid redundant computations—overlapping windows share data instead of re-slicing from scratch.
- No Explicit Loops: The
rolling().apply()call replaces your manualiterrows()loop, eliminating Python-level iteration overhead. - Memory Efficiency: Building the results in one go instead of row-by-row avoids repeated memory reallocations.
Key Notes
- Timestamp Unit: Adjust the
unitparameter inpd.to_datetime()if your timestamps aren't in milliseconds (e.g.,unit='s'for seconds). another_functionCompatibility: Ensureanother_functionaccepts a DataFrame (the window subset) and returns a scalar value. If you can rewriteanother_functionto use pandas' built-in vectorized functions instead of custom logic, you'll get even more speed.- Window Closure: The
closed='both'parameter matches your original logic where both the start and end timestamps are included in the window.
Bonus: Even More Speed (If Possible)
If another_function can be rewritten to use pandas' built-in aggregation functions (like sum(), mean(), or custom vectorized operations), skip apply() entirely and use direct rolling aggregations. For example:
# Replace this if your logic allows it full_results = rolling_window.mean() # Or any built-in aggregation
This will be orders of magnitude faster than apply() since it avoids calling a Python function for each window.
内容的提问来源于stack exchange,提问作者gammawind




