You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

如何基于用户访问日期计算每次访问时过去两年内的访问次数?

Got it, let's tackle this problem step by step. First, let's finish up that sample data code you started, then walk through two solid approaches to calculate the 2-year rolling visit count for each record.

Step 1: Complete the Sample Data

Your initial code was almost there—let's polish it to generate a clean, sorted dataset (sorting is critical for accurate rolling calculations):

import pandas as pd
import numpy as np

# Generate 8 unique random dates between 2010-01-01 and 2016-01-01
date_range = pd.date_range(pd.to_datetime('2010-01-01'), pd.to_datetime('2016-01-01'), freq='D')
date_samples = np.random.choice(date_range, 8, replace=False)

# Build the DataFrame with 2 users, 4 visits each
visits = {'user': list(np.repeat(1, 4)) + list(np.repeat(2, 4)), 
          'time': list(date_samples)}
df = pd.DataFrame(visits)

# Sort by user and visit time (mandatory for consistent results)
df = df.sort_values(by=['user', 'time']).reset_index(drop=True)

Step 2: Calculate the 2-Year Rolling Visit Count

We have two approaches here—one intuitive for small datasets, and one optimized for large datasets.

Approach 1: Intuitive (Small Datasets)

Use groupby + apply to count visits within the 2-year window for each record. This is easy to read but slower on big data:

def calculate_2yr_visits(group):
    # For each row, count all visits from the same user that fall between [current_time - 2 years, current_time]
    group['2yr_visit_count'] = group.apply(
        lambda row: len(group[(group['time'] >= row['time'] - pd.Timedelta(days=730)) & 
                              (group['time'] <= row['time'])]),
        axis=1
    )
    return group

# Apply the function to each user group
df = df.groupby('user').apply(calculate_2yr_visits).reset_index(drop=True)

Approach 2: Optimized (Large Datasets)

Use pandas' built-in rolling with a time window—this is vectorized and way faster for big datasets:

# Set the 'time' column as the index for time-based rolling operations
df = df.set_index('time')

# Calculate rolling count: window is 730 days (≈2 years), closed='both' includes the current record and the 2-year mark
df['2yr_visit_count'] = df.groupby('user')['user'].rolling(window='730D', closed='both').count().reset_index(level=0, drop=True)

# Reset the index to get back the original DataFrame structure
df = df.reset_index()

Key Notes

  • Sorting First: Always sort by user and time before any rolling calculations—unsorted data will give incorrect counts.
  • Time Window Choice: Use '730D' for a strict 2-year (365*2 day) window. If you want calendar-based years (e.g., 2024-05-01 → 2022-05-01), use '2Y' instead, but be aware this uses pandas' annual calendar logic (which handles leap years automatically).
  • Closed Window: The closed='both' argument ensures we include both the current visit date and the date exactly 2 years prior. Omit it if you want an open window (exclude one end).

内容的提问来源于stack exchange,提问作者Felix

火山引擎 最新活动