如何用R对空气质量数据集聚类，识别不同站点臭氧与PM高值时段？

阿华AIGC实验室

2026-5-19

Clustering Large Air Quality Datasets in R to Identify High Ozone/PM Periods

Hey there! Let's walk through how to cluster your 630k-row air quality dataset to pinpoint periods where ozone (rollingo3) and PM2.5 (rollingpm2.5) are high across different sites. We'll break this down into practical, actionable steps tailored to your data structure.

1. Load & Preprocess the Data

First, let's get your data loaded efficiently—since it's a large dataset, using data.table or readr will be faster than base R's read.csv. We'll also clean up the timestamp and focus on the variables we care about.

# Install/load required packages
install.packages(c("data.table", "dplyr", "ggplot2", "factoextra", "scales"))
library(data.table)
library(dplyr)
library(ggplot2)
library(factoextra)
library(scales)

# Load data (adjust file path as needed)
air_data <- fread("your_dataset.csv")

# Convert date to proper datetime format
air_data <- air_data %>%
  mutate(date = as.POSIXct(date, format = "%Y-%m-%d %H:%M:%S"))

# Handle missing values (if any) - simple mean imputation for key variables
air_data <- air_data %>%
  mutate(
    rollingo3 = ifelse(is.na(rollingo3), mean(rollingo3, na.rm = TRUE), rollingo3),
    rollingpm2.5 = ifelse(is.na(rollingpm2.5), mean(rollingpm2.5, na.rm = TRUE), rollingpm2.5)
  )

2. Standardize Variables

Ozone and PM2.5 have different units and scales, which will skew clustering results. We'll standardize them to have a mean of 0 and standard deviation of 1:

# Select clustering variables and standardize
cluster_vars <- air_data %>%
  select(rollingo3, rollingpm2.5) %>%
  scale() %>%
  as.data.frame()

3. Choose the Optimal Number of Clusters

For large datasets, the elbow method is efficient (skip silhouette scores—they're too slow for 600k+ rows). We'll use factoextra to visualize the elbow point:

# Run elbow method (test k from 2 to 8)
set.seed(123) # For reproducibility
elbow_plot <- fviz_nbclust(cluster_vars, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2, color = "red") # Adjust based on your elbow
print(elbow_plot)

Look for the point where the within-cluster sum of squares stops dropping sharply—this is your optimal k. For this use case, k=4 often works (separating low, medium, high ozone, high PM, etc.).

4. Run K-Means Clustering

K-means is fast and scalable for large datasets. We'll use the optimal k we found:

# Run k-means with multiple starts for stable results
set.seed(123)
kmeans_result <- kmeans(cluster_vars, centers = 4, nstart = 25)

# Add cluster labels back to the original dataset
air_data$cluster <- as.factor(kmeans_result$cluster)

5. Identify High-Value Clusters

Let's summarize each cluster's characteristics to find which groups correspond to high ozone/PM:

# Summarize cluster stats
cluster_summary <- air_data %>%
  group_by(cluster) %>%
  summarise(
    avg_o3 = mean(rollingo3),
    avg_pm = mean(rollingpm2.5),
    total_periods = n(),
    unique_sites = n_distinct(site)
  ) %>%
  arrange(desc(avg_o3, avg_pm))

print(cluster_summary)

Look for clusters with the highest avg_o3 and avg_pm—these are your high-value groups. Let's assume Cluster 4 is our target for this example.

6. Analyze High-Value Periods & Sites

Now we can dig into when and where these high values occur:

a. Time-Based Analysis

# Extract time components for analysis
air_data <- air_data %>%
  mutate(
    hour = hour(date),
    date_only = as.Date(date),
    month = month(date, label = TRUE)
  )

# Plot hourly distribution of high-value periods
air_data %>%
  filter(cluster == "4") %>% # Replace with your high cluster
  ggplot(aes(x = hour)) +
  geom_histogram(binwidth = 1, fill = "darkred", alpha = 0.7) +
  labs(title = "Hourly Distribution of High Ozone/PM Periods", x = "Hour of Day", y = "Count") +
  theme_minimal()

# Plot monthly distribution
air_data %>%
  filter(cluster == "4") %>%
  ggplot(aes(x = month)) +
  geom_bar(fill = "darkred", alpha = 0.7) +
  labs(title = "Monthly Distribution of High Ozone/PM Periods", x = "Month", y = "Count") +
  theme_minimal()

b. Site-Based Analysis

# Top 10 sites with the most high-value periods
top_sites <- air_data %>%
  filter(cluster == "4") %>%
  count(site, sort = TRUE) %>%
  head(10)

print(top_sites)

# Visualize top sites
top_sites %>%
  ggplot(aes(x = reorder(site, n), y = n)) +
  geom_bar(stat = "identity", fill = "darkblue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Top 10 Sites with Most High Ozone/PM Periods", x = "Site", y = "Count") +
  theme_minimal()

7. Bonus: Optimize for Extreme Large Datasets

If 630k rows is pushing memory limits:

Use h2o for distributed clustering (optimized for big data)
Test logic on a smaller sample first, then scale to full data
Stick with data.table for all data manipulations (faster than dplyr for very large datasets)

内容的提问来源于stack exchange，提问作者MFarhan