如何通过PromQL在Grafana中排除已重部署Pod的历史延迟数据
Core Approach: Use the up Metric to Filter Active Instances
The simplest and most maintainable way to exclude data from terminated pods is to leverage Prometheus' built-in up metric, which tracks whether a target (your pod) is currently reachable and scraping successfully (up=1 means active, up=0 or missing means inactive/terminated).
Working PromQL Query
Replace your-job-name and your target quantile with your actual values:
max(latency{quantile="0.95"} * on(exported_instance) group_left() up{job="your-job-name"})
How This Works
on(exported_instance): Matches thelatencytime series with the correspondinguptime series using theexported_instancelabel (adjust this to whatever label uniquely identifies your pods, likepodif that's available in your metrics).group_left(): Ensures we retain all labels from thelatencymetric while joining with theupmetric.- The multiplication operation: For active pods,
up=1so the latency value remains unchanged. For terminated pods, there's no matchingup=1time series, so thoselatencyentries are automatically excluded from the query results.
Why Staleness Settings Didn't Work for You
Staleness relies on Prometheus detecting that a target hasn't reported metrics within the configured staleness_delta (default 5m). If your terminated pod didn't properly stop reporting metrics, or if Prometheus hadn't marked it as stale yet (since you mentioned your staleness period is 15 days), this mechanism won't filter out the old data in a timely way. Using the up metric bypasses this entirely by explicitly filtering only active targets.
Alternative: Filter with last_over_time (If up Isn't Available)
If your setup doesn't expose the up metric, you can use last_over_time to check if a pod has reported metrics recently:
max(latency{quantile="0.95"} unless last_over_time(latency{quantile="0.95"}[10m]) == bool NaN)
This excludes any latency time series that haven't reported data in the last 10 minutes (adjust the window to match your scrape interval).
Why Your Previous Solutions Didn't Scale
- Modifying global staleness: Impacts all queries and can lead to unexpected data loss for long-running targets.
- Restarting Prometheus: Manual, unsustainable for frequent pod redeploys.
- Hardcoding instance IDs: Breaks immediately when pods redeploy and IDs change.
内容的提问来源于stack exchange,提问作者Pisush




