如何基于用户访问日期计算每次访问时过去两年内的访问次数?
Got it, let's tackle this problem step by step. First, let's finish up that sample data code you started, then walk through two solid approaches to calculate the 2-year rolling visit count for each record.
Step 1: Complete the Sample Data
Your initial code was almost there—let's polish it to generate a clean, sorted dataset (sorting is critical for accurate rolling calculations):
import pandas as pd import numpy as np # Generate 8 unique random dates between 2010-01-01 and 2016-01-01 date_range = pd.date_range(pd.to_datetime('2010-01-01'), pd.to_datetime('2016-01-01'), freq='D') date_samples = np.random.choice(date_range, 8, replace=False) # Build the DataFrame with 2 users, 4 visits each visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)), 'time': list(date_samples)} df = pd.DataFrame(visits) # Sort by user and visit time (mandatory for consistent results) df = df.sort_values(by=['user', 'time']).reset_index(drop=True)
Step 2: Calculate the 2-Year Rolling Visit Count
We have two approaches here—one intuitive for small datasets, and one optimized for large datasets.
Approach 1: Intuitive (Small Datasets)
Use groupby + apply to count visits within the 2-year window for each record. This is easy to read but slower on big data:
def calculate_2yr_visits(group): # For each row, count all visits from the same user that fall between [current_time - 2 years, current_time] group['2yr_visit_count'] = group.apply( lambda row: len(group[(group['time'] >= row['time'] - pd.Timedelta(days=730)) & (group['time'] <= row['time'])]), axis=1 ) return group # Apply the function to each user group df = df.groupby('user').apply(calculate_2yr_visits).reset_index(drop=True)
Approach 2: Optimized (Large Datasets)
Use pandas' built-in rolling with a time window—this is vectorized and way faster for big datasets:
# Set the 'time' column as the index for time-based rolling operations df = df.set_index('time') # Calculate rolling count: window is 730 days (≈2 years), closed='both' includes the current record and the 2-year mark df['2yr_visit_count'] = df.groupby('user')['user'].rolling(window='730D', closed='both').count().reset_index(level=0, drop=True) # Reset the index to get back the original DataFrame structure df = df.reset_index()
Key Notes
- Sorting First: Always sort by
userandtimebefore any rolling calculations—unsorted data will give incorrect counts. - Time Window Choice: Use
'730D'for a strict 2-year (365*2 day) window. If you want calendar-based years (e.g., 2024-05-01 → 2022-05-01), use'2Y'instead, but be aware this uses pandas' annual calendar logic (which handles leap years automatically). - Closed Window: The
closed='both'argument ensures we include both the current visit date and the date exactly 2 years prior. Omit it if you want an open window (exclude one end).
内容的提问来源于stack exchange,提问作者Felix




