如何在R中实现类似SUMIFS的双动态条件求和?
Hey there! Let's break down what's going wrong with your current code and fix it step by step.
What's Wrong with Your Existing Code?
Your script has a few critical issues that are stopping it from producing the right results:
group_by(listing_id)is counterproductive: Since eachlisting_idis unique (per your sample data), grouping by it limits your sum to only the current row instead of scanning the entire dataset for matching coordinates.- Logical operator error: You used
|(OR) instead of&(AND) for latitude checks—you need both latitude AND longitude to fall within the ±500 range, not either one. Also, comparing a row's latitude to itself (e.g.,calc.latitude >= calc.latitude - 500) will always be true, so it doesn't filter any other rows. - Missing longitude condition: You didn't include the ±500 range check for longitude at all, which is required for your problem.
Fix 1: Basic dplyr Implementation (Easy to Understand)
If you want to stick with dplyr, use rowwise() to process each row individually, and filter the full dataset for rows that meet both coordinate range criteria:
library(dplyr) # Calculate reviewsper1000 and add it to your original data REVIEWPERAREA <- TOTALLISTINGS %>% rowwise() %>% mutate( reviewsper1000 = sum( TOTALLISTINGS$reviews_last30days[ # Check latitude is within ±500 of the current row's value between(TOTALLISTINGS$calc.latitude, calc.latitude - 500, calc.latitude + 500) & # Check longitude is within ±500 of the current row's value between(TOTALLISTINGS$calc.longitude, calc.longitude - 500, calc.longitude + 500) ] ) ) %>% ungroup() # Exit rowwise mode to avoid unexpected behavior later
This will add a new reviewsper1000 column to your data frame with the sum you need for each row.
Fix 2: Efficient data.table Implementation (Better for Large Data)
Since your dataset is 500MB, the dplyr rowwise() approach might be slow (it scans the entire dataset once per row). For better performance, use data.table's non-equi join, which is optimized for range-based queries:
library(data.table) # Convert your data frame to a data.table for faster operations setDT(TOTALLISTINGS) # Perform non-equi join to calculate the sum for each coordinate range REVIEWPERAREA <- TOTALLISTINGS[ TOTALLISTINGS, on = .( calc.latitude >= calc.latitude - 500, calc.latitude <= calc.latitude + 500, calc.longitude >= calc.longitude - 500, calc.longitude <= calc.longitude + 500 ), .(listing_id, reviewsper1000 = sum(reviews_last30days)), by = .EACHI ] # Merge back with the original data if you need all columns REVIEWPERAREA <- merge(TOTALLISTINGS, REVIEWPERAREA, by = "listing_id")
Key Notes
- Coordinate Units: Double-check that
calc.latitudeandcalc.longitudeare in units where ±500 makes sense (e.g., meters). If they're in decimal degrees, ±500 would represent an enormous geographic range, which isn't what you want. - Extreme Performance: If even data.table feels slow, consider using spatial libraries like
sfto create spatial buffers around each point and perform spatial joins—this leverages spatial indexing for much faster queries on large datasets.
内容的提问来源于stack exchange,提问作者giannis.mvp




