如何在R中为Sales与PCT Change两类变量创建自定义分箱区间

阿华AIGC实验室

2026-5-27

Hey there! Let's work through your binning requirements in R step by step. I'll cover how to handle both Sales and PCT Change with grouping by Product, support custom bins, and keep the total number of bins under 10.

1. Handling Binning for `Sales` (2000-$19000 range)

Your current code uses seq() for breaks, but we need to adapt it to the 2000-19000 range and allow custom bins (like 4500-5000). Here's a flexible approach:

First, generate automatic breaks that split the Sales range into a maximum of 10 bins (we’ll use quantile() for even distribution, which works better for skewed data than fixed intervals)
Insert your custom break points, then clean up duplicates and sort the breaks to ensure valid, non-overlapping intervals

Example code snippet for Sales:

# Define custom breaks for Sales
custom_sales_breaks <- c(4500, 5000)

df <- df %>%
  group_by(Product) %>%
  mutate(
    # Generate auto breaks (max 10 bins = 11 break points) using quantile
    auto_sales_breaks = list(quantile(Sales, probs = seq(0, 1, length.out = 11), na.rm = TRUE)),
    # Merge auto + custom breaks, remove duplicates, sort
    combined_sales_breaks = list(sort(unique(c(unlist(auto_sales_breaks), custom_sales_breaks)))),
    # Create Sales bins, cap at 10 bins if combined breaks exceed the limit
    Sales_Bin = cut(Sales, 
                    breaks = if(length(unlist(combined_sales_breaks)) > 11) {
                      sort(unique(c(quantile(Sales, probs = seq(0, 1, length.out = 11), na.rm = TRUE), custom_sales_breaks)))[1:11]
                    } else {
                      unlist(combined_sales_breaks)
                    }, 
                    include.lowest = TRUE)
  ) %>%
  ungroup() %>%
  select(-auto_sales_breaks, -combined_sales_breaks) # Clean up helper columns

2. Binning `PCT Change` (supports positive/negative values)

For PCT Change with正负 values, we can use a similar logic, but we might want to ensure symmetric bins or explicitly handle positive/negative ranges. Here's how to add custom bins (e.g., -0.1 to 0, 0 to 0.1) while keeping total bins ≤10:

# Define custom breaks for PCT Change
custom_pct_breaks <- c(-0.1, 0, 0.1)

df <- df %>%
  group_by(Product) %>%
  mutate(
    # Auto breaks for PCT Change
    auto_pct_breaks = list(quantile(PCT_Change, probs = seq(0, 1, length.out = 11), na.rm = TRUE)),
    # Merge auto + custom breaks
    combined_pct_breaks = list(sort(unique(c(unlist(auto_pct_breaks), custom_pct_breaks)))),
    # Create PCT Change bins, cap at 10 bins
    PCT_Change_Bin = cut(PCT_Change, 
                         breaks = if(length(unlist(combined_pct_breaks)) > 11) {
                           sort(unique(c(quantile(PCT_Change, probs = seq(0, 1, length.out = 11), na.rm = TRUE), custom_pct_breaks)))[1:11]
                         } else {
                           unlist(combined_pct_breaks)
                         }, 
                         include.lowest = TRUE,
                         right = FALSE) # Use right=FALSE for [a,b) intervals instead of (a,b]
  ) %>%
  ungroup() %>%
  select(-auto_pct_breaks, -combined_pct_breaks)

3. Simplify with `dplyr::across()` for Both Variables

To make the code cleaner and avoid repetition, we can use across() to apply the binning logic to both variables at once. Here's a consolidated version:

# Define custom breaks for each variable in a list
custom_breaks_list <- list(
  Sales = c(4500, 5000),
  PCT_Change = c(-0.1, 0, 0.1)
)

# Reusable function to create bins with custom breaks and max 10 bins
create_bins <- function(x, custom_breaks) {
  auto_breaks <- quantile(x, probs = seq(0, 1, length.out = 11), na.rm = TRUE)
  combined_breaks <- sort(unique(c(auto_breaks, custom_breaks)))
  # Cap at 10 bins (11 break points) if needed
  if(length(combined_breaks) > 11) {
    combined_breaks <- sort(unique(c(auto_breaks, custom_breaks)))[1:11]
  }
  cut(x, breaks = combined_breaks, include.lowest = TRUE)
}

# Apply grouping and binning to both variables
df <- df %>%
  group_by(Product) %>%
  mutate(
    across(c(Sales, PCT_Change), 
           ~create_bins(.x, custom_breaks_list[[cur_column()]]),
           .names = "{.col}_Bin")
  ) %>%
  ungroup()

Key Notes

Using quantile() ensures bins are distributed based on the actual data in each Product group, which is better than fixed intervals if sales performance varies widely between products.
The include.lowest = TRUE parameter ensures the smallest value in each group is included in the first bin (matches your original code’s behavior).
For PCT_Change, the right = FALSE argument makes intervals like [-0.2, -0.1) instead of (-0.2, -0.1]—adjust this based on how you want to handle bin boundaries.
If you prefer fixed-width bins instead of quantile-based, replace quantile() with seq(min(x, na.rm=TRUE), max(x, na.rm=TRUE), length.out=11) (note: this may create empty bins if your data is skewed).