如何通过PromQL在Grafana中排除已重部署Pod的历史延迟数据

阿华AIGC实验室

2026-5-14

Solution: Filter Out Stale Pod Data in PromQL Max Latency Query

Core Approach: Use the `up` Metric to Filter Active Instances

The simplest and most maintainable way to exclude data from terminated pods is to leverage Prometheus' built-in up metric, which tracks whether a target (your pod) is currently reachable and scraping successfully (up=1 means active, up=0 or missing means inactive/terminated).

Working PromQL Query

Replace your-job-name and your target quantile with your actual values:

max(latency{quantile="0.95"} * on(exported_instance) group_left() up{job="your-job-name"})

How This Works

on(exported_instance): Matches the latency time series with the corresponding up time series using the exported_instance label (adjust this to whatever label uniquely identifies your pods, like pod if that's available in your metrics).
group_left(): Ensures we retain all labels from the latency metric while joining with the up metric.
The multiplication operation: For active pods, up=1 so the latency value remains unchanged. For terminated pods, there's no matching up=1 time series, so those latency entries are automatically excluded from the query results.

Why Staleness Settings Didn't Work for You

Staleness relies on Prometheus detecting that a target hasn't reported metrics within the configured staleness_delta (default 5m). If your terminated pod didn't properly stop reporting metrics, or if Prometheus hadn't marked it as stale yet (since you mentioned your staleness period is 15 days), this mechanism won't filter out the old data in a timely way. Using the up metric bypasses this entirely by explicitly filtering only active targets.

Alternative: Filter with `last_over_time` (If `up` Isn't Available)

If your setup doesn't expose the up metric, you can use last_over_time to check if a pod has reported metrics recently:

max(latency{quantile="0.95"} unless last_over_time(latency{quantile="0.95"}[10m]) == bool NaN)

This excludes any latency time series that haven't reported data in the last 10 minutes (adjust the window to match your scrape interval).

Why Your Previous Solutions Didn't Scale

Modifying global staleness: Impacts all queries and can lead to unexpected data loss for long-running targets.
Restarting Prometheus: Manual, unsustainable for frequent pod redeploys.
Hardcoding instance IDs: Breaks immediately when pods redeploy and IDs change.

内容的提问来源于stack exchange，提问作者Pisush