基于分组的累积最大值(cumulative max):SAS/PROC SQL实现咨询
Great question! Calculating cumulative maxima across multiple groups efficiently for datasets with 10M to 100M rows is a common pain point, but native SAS tools are well-suited for this task—especially if you’re open to row reordering. Below are the most performant approaches, prioritized by speed for large-scale data.
Core Approach: Data Step with BY Groups (Best for Big Data)
SAS’s Data Step is optimized for sequential row processing, making it far faster than PROC SQL for window functions on large datasets. The key is leveraging sorted BY groups and the RETAIN statement to track cumulative values without expensive computations.
Step 1: Sort Your Data First
BY group processing requires your data to be sorted by the grouping columns plus the column that defines the order of the cumulative calculation (e.g., a date, sequence ID). SAS’s PROC SORT is highly optimized for large data—use the SORTSIZE option if you need to allocate more memory for faster sorting.
/* Sort by grouping columns (group1, group2) and order column (event_date) */ proc sort data=your_large_dataset out=sorted_data sortsize=10G; /* Adjust memory as needed */ by group1 group2 event_date; run;
Step 2: Calculate Cumulative Max with RETAIN
Use RETAIN to keep track of the maximum value across rows, and reset it whenever a new group starts (detected via first.<last_group_column>). You can compute cumulative maxima for multiple variables in a single pass.
data cumulative_max_results; set sorted_data; by group1 group2; /* Retain cumulative max values across rows */ retain cum_max_var1 cum_max_var2; /* Add all variables you need to compute */ /* Reset cumulative max at the start of a new group */ if first.group2 then do; /* Use the final grouping column from your BY statement */ cum_max_var1 = var1; cum_max_var2 = var2; end; else do; /* Update cumulative max: take the higher of current value or retained max */ cum_max_var1 = max(cum_max_var1, var1); cum_max_var2 = max(cum_max_var2, var2); end; /* Optional: Drop unused variables to reduce output size */ /* drop temp_vars; */ run;
Why This Works
RETAINpreserves the value of your cumulative max variables across row iterations (instead of resetting to missing each time).- BY group processing triggers
first.group2to reset the max whenever a new group begins, ensuring accurate per-group calculations. - This method processes each row exactly once, giving it an O(n) time complexity—perfect for ultra-large datasets.
Alternative: PROC SQL with Window Functions (For Smaller Data)
If you prefer SQL syntax, you can use MAX() OVER() window functions. However, note that this is significantly slower for 10M+ rows compared to the Data Step approach. It’s better suited for ad-hoc queries or smaller datasets.
proc sql; create table cumulative_max_sql as select group1, group2, event_date, var1, var2, max(var1) over(partition by group1, group2 order by event_date) as cum_max_var1, max(var2) over(partition by group1, group2 order by event_date) as cum_max_var2 from your_large_dataset; quit;
Optimizations for 100M+ Rows
- Increase buffer size: Use
bufsizeandbufnoin the Data Step to reduce I/O operations:data cumulative_max_results; set sorted_data bufsize=64000 bufno=10; /* ... rest of code ... */ run; - Trim unnecessary variables: Use
KEEPorDROPto reduce the amount of data loaded into memory. - Leverage SAS Viya: If available, distributed processing can handle even larger datasets by splitting the work across nodes.
Example Verification
Let’s test with a small sample to confirm the output matches expectations:
Sample Input
| group1 | group2 | event_date | var1 | var2 |
|---|---|---|---|---|
| A | X | 2023-01-01 | 5 | 10 |
| A | X | 2023-01-02 | 3 | 15 |
| A | Y | 2023-01-01 | 7 | 8 |
| B | X | 2023-01-01 | 2 | 20 |
| B | X | 2023-01-02 | 6 | 18 |
Expected Output
| group1 | group2 | event_date | var1 | var2 | cum_max_var1 | cum_max_var2 |
|---|---|---|---|---|---|---|
| A | X | 2023-01-01 | 5 | 10 | 5 | 10 |
| A | X | 2023-01-02 | 3 | 15 | 5 | 15 |
| A | Y | 2023-01-01 | 7 | 8 | 7 | 8 |
| B | X | 2023-01-01 | 2 | 20 | 2 | 20 |
| B | X | 2023-01-02 | 6 | 18 | 6 | 20 |
The Data Step code above will produce exactly this output.
内容的提问来源于stack exchange,提问作者moodymudskipper




