如何使用R语言从比赛日志(Game Logs)复刻MLB风格棒球数据拆分表(Splits)
Hey there! You’re already on the right track targeting purrr and structured data to build those MLB-style splits tables. Let’s break down the optimal, efficient workflow step by step:
First: Structure Your Game Log Data Properly
The "hidden" split dimensions in game logs are just categorical variables you need to explicitly extract or create. Start by cleaning your raw game log data to add all the split columns you care about:
library(tidyverse) library(lubridate) # Assume `game_log` is your imported raw game log dataframe clean_game_log <- game_log %>% # Extract month as a readable label mutate(game_month = month(game_date, label = TRUE, abbr = FALSE)) %>% # Flag home/away (adjust team ID to match your player's team) mutate(home_away = ifelse(team == "WSH", "Home", "Away")) %>% # Classify day/night games based on start time mutate(day_night = case_when( str_detect(game_time, "AM|PM$") & str_sub(game_time, -4, -3) %in% c("10", "11", "12", "01", "02", "03") ~ "Day", TRUE ~ "Night" )) %>% # Filter out games where the player didn't appear (optional but clean) filter(AB > 0)
Second: Use purrr for Batch Split Calculations
This is where purrr shines—you can automate the process of calculating splits across all your dimensions without repeating code. Here's how:
1. Define Your Split Dimensions as a Named List
List out every split you want to generate, mapping friendly names to the column you created above:
split_definitions <- list( "Home vs. Away" = "home_away", "Day vs. Night" = "day_night", "By Month" = "game_month" # Add more splits here (e.g., "Left-Handed Pitchers" = "opp_pitch_hand") )
2. Write a Reusable Split Calculation Function
Create a function that takes your cleaned data and a split column, then computes all the key hitting stats (plus AVG/OBP/SLG):
compute_split_stats <- function(data, split_col) { data %>% group_by(!!sym(split_col)) %>% summarize( Games = n(), AB = sum(AB, na.rm = TRUE), Hits = sum(H, na.rm = TRUE), BB = sum(BB, na.rm = TRUE), HBP = sum(HBP, na.rm = TRUE), XBH = sum(X2B + X3B + HR, na.rm = TRUE), # Calculate rate stats with rounding AVG = round(Hits / AB, 3), OBP = round((Hits + BB + HBP) / (AB + BB + HBP + SF), 3), SLG = round((Hits + X2B + 2*X3B + 3*HR) / AB, 3) ) %>% # Add a column to label which split this is mutate(Split_Category = names(split_col)) }
3. Batch Process All Splits & Combine Results
Use purrr::imap_dfr() to run the function across every split in your list, then bind all results into one clean table:
final_splits_table <- imap_dfr(split_definitions, ~compute_split_stats(clean_game_log, .x)) %>% # Reorder columns to match MLB's layout select(Split_Category, everything())
Third: Polish & Extend
- Handle Edge Cases: Add checks for division by zero (e.g., if a player has 0 AB in a split, set AVG/OBP/SLG to NA instead of NaN)
- Visualize or Format: Use packages like
gtorkableExtrato turn the dataframe into a polished, MLB-style table with formatting (e.g., highlight top stats) - Add Custom Splits: Want to split by opponent division or pitch type? Just add the relevant column to your cleaned data and update the
split_definitionslist—no extra code needed!
Here’s a quick preview of what your final table might look like:
| Split_Category | home_away | Games | AB | Hits | AVG | OBP | SLG |
|---|---|---|---|---|---|---|---|
| Home vs. Away | Home | 45 | 160 | 48 | 0.300 | 0.410 | 0.520 |
| Home vs. Away | Away | 42 | 152 | 42 | 0.276 | 0.385 | 0.490 |
| Day vs. Night | Day | 22 | 78 | 21 | 0.269 | 0.370 | 0.487 |
内容的提问来源于stack exchange,提问作者Mutuelinvestor




