如何用R对空气质量数据集聚类,识别不同站点臭氧与PM高值时段?
Hey there! Let's walk through how to cluster your 630k-row air quality dataset to pinpoint periods where ozone (rollingo3) and PM2.5 (rollingpm2.5) are high across different sites. We'll break this down into practical, actionable steps tailored to your data structure.
1. Load & Preprocess the Data
First, let's get your data loaded efficiently—since it's a large dataset, using data.table or readr will be faster than base R's read.csv. We'll also clean up the timestamp and focus on the variables we care about.
# Install/load required packages install.packages(c("data.table", "dplyr", "ggplot2", "factoextra", "scales")) library(data.table) library(dplyr) library(ggplot2) library(factoextra) library(scales) # Load data (adjust file path as needed) air_data <- fread("your_dataset.csv") # Convert date to proper datetime format air_data <- air_data %>% mutate(date = as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S")) # Handle missing values (if any) - simple mean imputation for key variables air_data <- air_data %>% mutate( rollingo3 = ifelse(is.na(rollingo3), mean(rollingo3, na.rm = TRUE), rollingo3), rollingpm2.5 = ifelse(is.na(rollingpm2.5), mean(rollingpm2.5, na.rm = TRUE), rollingpm2.5) )
2. Standardize Variables
Ozone and PM2.5 have different units and scales, which will skew clustering results. We'll standardize them to have a mean of 0 and standard deviation of 1:
# Select clustering variables and standardize cluster_vars <- air_data %>% select(rollingo3, rollingpm2.5) %>% scale() %>% as.data.frame()
3. Choose the Optimal Number of Clusters
For large datasets, the elbow method is efficient (skip silhouette scores—they're too slow for 600k+ rows). We'll use factoextra to visualize the elbow point:
# Run elbow method (test k from 2 to 8) set.seed(123) # For reproducibility elbow_plot <- fviz_nbclust(cluster_vars, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2, color = "red") # Adjust based on your elbow print(elbow_plot)
Look for the point where the within-cluster sum of squares stops dropping sharply—this is your optimal k. For this use case, k=4 often works (separating low, medium, high ozone, high PM, etc.).
4. Run K-Means Clustering
K-means is fast and scalable for large datasets. We'll use the optimal k we found:
# Run k-means with multiple starts for stable results set.seed(123) kmeans_result <- kmeans(cluster_vars, centers = 4, nstart = 25) # Add cluster labels back to the original dataset air_data$cluster <- as.factor(kmeans_result$cluster)
5. Identify High-Value Clusters
Let's summarize each cluster's characteristics to find which groups correspond to high ozone/PM:
# Summarize cluster stats cluster_summary <- air_data %>% group_by(cluster) %>% summarise( avg_o3 = mean(rollingo3), avg_pm = mean(rollingpm2.5), total_periods = n(), unique_sites = n_distinct(site) ) %>% arrange(desc(avg_o3, avg_pm)) print(cluster_summary)
Look for clusters with the highest avg_o3 and avg_pm—these are your high-value groups. Let's assume Cluster 4 is our target for this example.
6. Analyze High-Value Periods & Sites
Now we can dig into when and where these high values occur:
a. Time-Based Analysis
# Extract time components for analysis air_data <- air_data %>% mutate( hour = hour(date), date_only = as.Date(date), month = month(date, label = TRUE) ) # Plot hourly distribution of high-value periods air_data %>% filter(cluster == "4") %>% # Replace with your high cluster ggplot(aes(x = hour)) + geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.7) + labs(title = "Hourly Distribution of High Ozone/PM Periods", x = "Hour of Day", y = "Count") + theme_minimal() # Plot monthly distribution air_data %>% filter(cluster == "4") %>% ggplot(aes(x = month)) + geom_bar(fill = "darkred", alpha = 0.7) + labs(title = "Monthly Distribution of High Ozone/PM Periods", x = "Month", y = "Count") + theme_minimal()
b. Site-Based Analysis
# Top 10 sites with the most high-value periods top_sites <- air_data %>% filter(cluster == "4") %>% count(site, sort = TRUE) %>% head(10) print(top_sites) # Visualize top sites top_sites %>% ggplot(aes(x = reorder(site, n), y = n)) + geom_bar(stat = "identity", fill = "darkblue", alpha = 0.7) + coord_flip() + labs(title = "Top 10 Sites with Most High Ozone/PM Periods", x = "Site", y = "Count") + theme_minimal()
7. Bonus: Optimize for Extreme Large Datasets
If 630k rows is pushing memory limits:
- Use
h2ofor distributed clustering (optimized for big data) - Test logic on a smaller sample first, then scale to full data
- Stick with
data.tablefor all data manipulations (faster than dplyr for very large datasets)
内容的提问来源于stack exchange,提问作者MFarhan




