如何在Pandas内置rolling外实现高效自定义滚动窗口操作？

阿华AIGC实验室

2026-5-8

Fast Vectorized Solution for Rolling Rank Correlation Between Two DataFrames

Hey there! I totally get how frustrating those nested loops can be—they’re slow, clunky, and feel like they defeat the purpose of using Pandas in the first place. It’s true that Pandas doesn’t have a single built-in function to handle this exact workflow (rolling rank + cross-correlation between two DataFrames), but we can leverage vectorized operations with NumPy to make this run way faster.

The Core Idea

Instead of looping through every row and column (which forces Python to handle each tiny operation individually), we’ll use NumPy’s sliding window tools to batch-process all windows at once. This lets us take advantage of optimized C-level operations, which are orders of magnitude faster than Python loops.

Step-by-Step Implementation

Let’s walk through the code with explanations:

1. Import Required Libraries

import pandas as pd
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
from scipy.stats import rankdata  # For accurate ranking matching Pandas' default

2. Convert DataFrames to NumPy Arrays

First, we’ll convert df1 and df2 to NumPy arrays for faster low-level operations:

arr1 = df1.to_numpy()
arr2 = df2.to_numpy()
window_size = 5

3. Generate Sliding Windows

Using sliding_window_view, we can create a view of all rolling windows for every column—this doesn’t copy data, so it’s memory-efficient:

# Shape after sliding: (number_of_columns, number_of_windows, window_size)
windows1 = sliding_window_view(arr1.T, window_size, axis=1)
windows2 = sliding_window_view(arr2.T, window_size, axis=1)

arr1.T transposes our data to (columns, rows) so we can slide along the row axis.
Each entry in windows1 is a 5-element window from one column of df1.

4. Compute Rolling Ranks

We’ll calculate ranks for every window, matching Pandas’ default method='average' using scipy.stats.rankdata:

# Rank each window along the window axis (axis=2)
rank1 = np.array([rankdata(window, axis=1) for window in windows1])
rank2 = np.array([rankdata(window, axis=1) for window in windows2])

If you don’t need to handle tied values (and want even faster speed), you can use double argsort instead (this gives min-ranking):

rank1 = windows1.argsort(axis=2).argsort(axis=2) + 1  # +1 to start ranks at 1
rank2 = windows2.argsort(axis=2).argsort(axis=2) + 1

5. Calculate Window-Wise Correlation

Now we’ll compute the Pearson correlation between the ranks of each corresponding window in df1 and df2:

# Compute correlation for each column's windows
corrs = np.array([np.corrcoef(r1, r2)[0, 1] for r1, r2 in zip(rank1, rank2)])

np.corrcoef returns a 2x2 matrix; we take the off-diagonal value ([0,1]) which is the correlation between the two rank sequences.

6. Convert Results Back to DataFrame

Finally, we’ll create df3 with the correlation results, padding the first 4 rows with NaN (since we need 5 rows to form the first valid window):

# Create result DataFrame with matching columns/index to original data
df3 = pd.DataFrame(
    data=np.vstack([np.full((window_size-1, arr1.shape[1]), np.nan), corrs.T]),
    columns=df1.columns,
    index=df1.index
)

Why This Is Faster

No Python loops: All heavy lifting is done in optimized NumPy/SciPy C code.
Memory-efficient sliding windows: sliding_window_view creates a view of the original data instead of copying it.
Batch processing: We handle all windows for all columns in parallel, rather than one at a time.

Alternative Pandas-Only Approach (Slower but More Readable)

If you prefer to stick with Pandas syntax (and don’t mind a small performance hit), you can use rolling.apply with a custom function. This avoids row loops but still loops over columns:

def rolling_rank_corr(x, df2_col):
    # x is a window from df1's column; get matching window from df2's column
    df2_window = df2_col.loc[x.index]
    # Compute ranks and correlation
    return np.corrcoef(x.rank(), df2_window.rank())[0, 1]

window_size = 5
df3 = pd.DataFrame()
for col in df1.columns:
    df3[col] = df1[col].rolling(window=window_size).apply(lambda x: rolling_rank_corr(x, df2[col]))

This is easier to read but will be slower than the NumPy approach, especially with many columns.

内容的提问来源于stack exchange，提问作者rollback