单特征异常检测算法咨询：除K-Means、3 Sigma外的可选方案

阿华AIGC实验室

2026-5-6

Great question! Since you're focusing on univariate (single-feature) outlier detection and already know 3 Sigma and K-Means, here are some solid alternatives tailored to this use case, along with quick breakdowns of when to use each:

Recommended Univariate Outlier Detection Methods

1. IQR (Interquartile Range) Method

This is one of the most widely used univariate approaches—super intuitive and doesn't rely on assumptions about data distribution (unlike 3 Sigma which needs normality).

How it works: Calculate the 25th percentile (Q1) and 75th percentile (Q3) of your column. The IQR is Q3 - Q1. Any value below Q1 - 1.5*IQR or above Q3 + 1.5*IQR is flagged as an outlier.
Best for: Skewed datasets (like user transaction amounts) or when you don't want to make distribution assumptions.

Quick Python example:

import pandas as pd

df = pd.read_csv('your_data.csv')
target_col = df['your_target_column']

q1 = target_col.quantile(0.25)
q3 = target_col.quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = target_col[(target_col < lower_bound) | (target_col > upper_bound)]

2. Robust Z-Score

Fixes the biggest flaw of the standard 3 Sigma method: it uses median and Median Absolute Deviation (MAD) instead of mean and standard deviation, making it immune to extreme values skewing the calculation.

How it works: The formula is Z = 0.6745 * (x - median) / MAD. Values with an absolute Z-score > 3.5 are typically considered outliers.
Best for: Datasets with obvious extreme values or non-normal distributions.

Quick Python example:

import pandas as pd
import numpy as np

df = pd.read_csv('your_data.csv')
target_col = df['your_target_column']

median_val = target_col.median()
mad = np.median(np.abs(target_col - median_val))
z_scores = 0.6745 * (target_col - median_val) / mad

outliers = target_col[np.abs(z_scores) > 3.5]

3. Isolation Forest

A purpose-built outlier detection algorithm that excels with single-feature data, especially large datasets. It works by randomly splitting the feature space—outliers get isolated with far fewer splits than normal points.

How it works: Train the forest on your single feature, then it assigns a "anomaly score" to each point; points with scores above a threshold are flagged as outliers.
Best for: Large datasets where you want to avoid manual parameter tuning (default settings often work well).

Quick Python example:

from sklearn.ensemble import IsolationForest
import pandas as pd

df = pd.read_csv('your_data.csv')
X = df['your_target_column'].values.reshape(-1, 1)  # Reshape for scikit-learn

iso_forest = IsolationForest(contamination=0.05)  # Assume 5% of data are outliers
labels = iso_forest.fit_predict(X)

outliers = df[labels == -1]  # -1 indicates outlier labels

4. DBSCAN (Density-Based Clustering)

While often used for multi-feature data, it works perfectly for single-feature tasks. It identifies outliers as points that exist in low-density regions of the feature space.

How it works: Define a radius (eps) and minimum number of samples (min_samples) required to form a dense cluster. Points not part of any cluster are marked as outliers.
Best for: Datasets with natural clustering patterns (e.g., most values cluster around two peaks, with outliers scattered far away).

Quick Python example:

from sklearn.cluster import DBSCAN
import pandas as pd

df = pd.read_csv('your_data.csv')
X = df['your_target_column'].values.reshape(-1, 1)

dbscan = DBSCAN(eps=0.5, min_samples=5)  # Tune eps based on your data's scale
labels = dbscan.fit_predict(X)

outliers = df[labels == -1]

5. Percentile Thresholding

The simplest approach for when you have domain knowledge about what constitutes an "extreme" value.

How it works: Set hard thresholds using percentiles (e.g., flag values below the 1st percentile or above the 99th percentile as outliers).
Best for: Business use cases where you have clear rules (e.g., "any transaction over 99% of all user purchases is suspicious").

Quick Python example:

import pandas as pd

df = pd.read_csv('your_data.csv')
target_col = df['your_target_column']

lower_thresh = target_col.quantile(0.01)
upper_thresh = target_col.quantile(0.99)

outliers = target_col[(target_col < lower_thresh) | (target_col > upper_thresh)]

Quick Tip for Validation

Since you've manually added outliers to test performance, I'd suggest running a few of these methods and comparing their precision/recall against your labeled outliers. For example, IQR or Robust Z-Score are great for non-normal data, while Isolation Forest scales best with large datasets.

内容的提问来源于stack exchange，提问作者E199504