基于预测概率计算标准化感染率（SIR）的方法选择咨询

阿华AIGC实验室

2026-5-19

Got it, let's break down how to calculate the Standardized Infection Ratio (SIR) when navigating the CDC's guidelines vs. common machine learning practices—this is a super common point of confusion, so I'm glad you're asking!

标准化感染率（SIR）计算：CDC指南 vs 机器学习实践

1. CDC指南中的预测病例数计算

Per the CDC's approach, the predicted number of cases is the sum of the infection probabilities generated by the model for each individual observation.

For example: If you have 100 patients, and the model outputs infection probabilities like 0.02, 0.05, 0.01 for each person, you add all these probabilities together. That total is your predicted case count. Then SIR is simply:
SIR = 实际感染病例数 / 预测概率总和

核心逻辑：CDC’s method uses the sum of individual risk probabilities to represent the "expected number of cases" for the population. It accounts for variation in individual risk, making it a population-level expected value calculation.

2. Common Practices in Machine Learning Research & Applications

In ML contexts, two approaches are widely used:

Threshold-based counting: Set a probability threshold (e.g., 0.5), count how many observations have a predicted probability ≥ this threshold, and use that count as your predicted case number. SIR then becomes 实际感染病例数 / 预测阳性病例数.
Calibrated weighted summation: Similar to the CDC method, but often with calibrated probabilities (to fix model miscalibration) or focused on high-risk subsets (e.g., summing probabilities only for the top 20% highest-risk individuals) to target specific risk groups.

3. Which Approach Should You Use?

If your analysis needs to align with official public health monitoring or reporting, strictly follow the CDC's probability summation method—this ensures your results are comparable to standardized surveillance data.
If you're evaluating ML model performance or doing internal risk stratification, you can use the ML-specific methods, but make sure to explicitly document your approach in any reports to avoid ambiguity.

Quick Code Example (Python)

Here’s how to implement both methods:

import numpy as np

# Simulated data
actual_cases = 8
predicted_probs = np.array([0.03, 0.07, 0.01, 0.1, 0.02, 0.05, 0.08, 0.04, 0.06, 0.09])

# CDC-compliant predicted cases & SIR
cdc_predicted = predicted_probs.sum()
cdc_sir = actual_cases / cdc_predicted
print(f"CDC Standard SIR: {cdc_sir:.2f}")

# ML threshold-based predicted cases & SIR (threshold = 0.05)
ml_predicted = (predicted_probs >= 0.05).sum()
ml_sir = actual_cases / ml_predicted
print(f"ML Threshold-Based SIR: {ml_sir:.2f}")

内容的提问来源于stack exchange，提问作者Bakaburg