在Pandas DataFrame中按日期生成客户商品历史列(同日期互斥)
Solution for Adding Item History Excluding Same-Day Purchases
Got it, let's fix this problem properly! The key issue with your original approach is that it uses row indices to build the history, which includes same-day items since they're consecutive rows. We need to group by date first within each customer to ensure same-day purchases don't appear in each other's history.
Step-by-Step Solution
First, let's make sure our date column is in a format we can compare, then we'll build the history based on dates instead of row positions:
import pandas as pd import numpy as np # Your sample data df = pd.DataFrame({ 'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos'] }) # 1. Convert date strings to datetime objects (critical for proper ordering) df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y') # 2. Sort by Customer and Date to ensure chronological order df = df.sort_values(['Customer', 'Date']).reset_index(drop=True) # 3. Define a function to build item history per customer def build_customer_history(group): # Group the customer's purchases by date to get all items per day daily_items = group.groupby('Date')['Item'].apply(list).reset_index() # Cumulate all previous days' items, shifting by 1 to exclude the current day daily_items['Item History'] = daily_items['Item'].cumsum().shift(1) # Replace empty histories with NaN (instead of empty lists) daily_items['Item History'] = daily_items['Item History'].apply( lambda x: x if isinstance(x, list) and len(x) > 0 else np.nan ) # Merge the daily history back to the original group rows return group.merge(daily_items[['Date', 'Item History']], on='Date', how='left') # 4. Apply the function to each customer group df = df.groupby('Customer').apply(build_customer_history).reset_index(drop=True) # Let's check the result print(df[['Customer', 'Date', 'Item', 'Item History']])
What This Does
- Date Conversion: Ensures we can correctly compare dates to order purchases chronologically.
- Sorting: Makes sure each customer's records are in the right time order before building history.
- Daily Grouping: By first grouping purchases per date, we ensure all same-day rows share the same history (which doesn't include their own day's items).
- Cumulative Shift: Using
cumsum()accumulates all previous days' items, andshift(1)drops the current day's items from the history. - NaN Handling: Replaces empty histories (for a customer's first purchase) with NaN instead of empty lists, matching your original approach.
Sample Output Excerpt
For Bert's 20/01/2019 purchases, you'll see:
| Customer | Date | Item | Item History |
|---|---|---|---|
| Bert | 2019-01-20 | Apples | ['Bread', 'Cheese'] |
| Bert | 2019-01-20 | Pears | ['Bread', 'Cheese'] |
| Bert | 2019-01-22 | Toothbrush | ['Bread', 'Cheese', 'Apples', 'Pears'] |
Perfect—same-day items don't appear in each other's history, and later dates include all prior purchases.
内容的提问来源于stack exchange,提问作者Dan Scally




