如何基于用户访问日期计算每次访问时过去两年内的访问次数？

阿华AIGC实验室

2026-5-20

Got it, let's tackle this problem step by step. First, let's finish up that sample data code you started, then walk through two solid approaches to calculate the 2-year rolling visit count for each record.

Step 1: Complete the Sample Data

Your initial code was almost there—let's polish it to generate a clean, sorted dataset (sorting is critical for accurate rolling calculations):

import pandas as pd
import numpy as np

# Generate 8 unique random dates between 2010-01-01 and 2016-01-01
date_range = pd.date_range(pd.to_datetime('2010-01-01'), pd.to_datetime('2016-01-01'), freq='D')
date_samples = np.random.choice(date_range, 8, replace=False)

# Build the DataFrame with 2 users, 4 visits each
visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)), 
          'time': list(date_samples)}
df = pd.DataFrame(visits)

# Sort by user and visit time (mandatory for consistent results)
df = df.sort_values(by=['user', 'time']).reset_index(drop=True)

Step 2: Calculate the 2-Year Rolling Visit Count

We have two approaches here—one intuitive for small datasets, and one optimized for large datasets.

Approach 1: Intuitive (Small Datasets)

Use groupby + apply to count visits within the 2-year window for each record. This is easy to read but slower on big data:

def calculate_2yr_visits(group):
    # For each row, count all visits from the same user that fall between [current_time - 2 years, current_time]
    group['2yr_visit_count'] = group.apply(
        lambda row: len(group[(group['time'] >= row['time'] - pd.Timedelta(days=730)) & 
                              (group['time'] <= row['time'])]),
        axis=1
    )
    return group

# Apply the function to each user group
df = df.groupby('user').apply(calculate_2yr_visits).reset_index(drop=True)

Approach 2: Optimized (Large Datasets)

Use pandas' built-in rolling with a time window—this is vectorized and way faster for big datasets:

# Set the 'time' column as the index for time-based rolling operations
df = df.set_index('time')

# Calculate rolling count: window is 730 days (≈2 years), closed='both' includes the current record and the 2-year mark
df['2yr_visit_count'] = df.groupby('user')['user'].rolling(window='730D', closed='both').count().reset_index(level=0, drop=True)

# Reset the index to get back the original DataFrame structure
df = df.reset_index()

Key Notes

Sorting First: Always sort by user and time before any rolling calculations—unsorted data will give incorrect counts.
Time Window Choice: Use '730D' for a strict 2-year (365*2 day) window. If you want calendar-based years (e.g., 2024-05-01 → 2022-05-01), use '2Y' instead, but be aware this uses pandas' annual calendar logic (which handles leap years automatically).
Closed Window: The closed='both' argument ensures we include both the current visit date and the date exactly 2 years prior. Omit it if you want an open window (exclude one end).

内容的提问来源于stack exchange，提问作者Felix