最近更新时间:2023.12.14 15:02:39
首次发布时间:2023.11.17 15:31:30
托管 Prometheus 控制台中预置了常见的 VKE 集群监控看板,本文为您介绍网络服务监控看板信息。
vke-core-dns-dashboard 为 core-dns 监控看板,展示了集群中所有或指定 core-dns 实例的监控信息,包括:DNS 请求、DNS 错误率、DNS 转发、DNS 缓存命中率、请求响应延时(P90)、转发请求响应延时(P90)等。
core-dns 监控看板的指标清单如下表所示。
看板分类 | 看板名称 | PromQL 语句 |
---|---|---|
core-dns 监控 | DNS 请求 | sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) |
DNS 错误率 | 100*sum(increase(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 转发 | sum(increase(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 缓存命中率 | 100*sum(increase(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 请求(按协议) | sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (proto) | |
DNS 请求(按类型) | sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (type) | |
DNS 转发 | sum(rate(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0) | |
DNS 错误率 | 100*sum(rate(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[5m]))/sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0) | |
缓存命中/未命中 | sum(rate(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) | |
sum(rate(coredns_cache_misses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) | ||
请求响应延时(P90) | 1000*histogram_quantile(0.9, sum(rate(coredns_dns_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le)) | |
转发请求响应延时(P90) | 1000*histogram_quantile(0.9, sum(rate(coredns_forward_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le)) |
说明
如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"
参数中的$Cluster
变量修改为具体的集群 ID ,或直接删除该参数。
vke-local-dns-dashboard 为 node-local-dns 监控看板,展示了所有或指定 node-local-dns 实例的监控信息,包括:DNS 请求、DNS 错误率、DNS 转发、DNS 缓存命中率、请求响应延时(P90)、转发请求响应延时(P90)等。
node-local-dns 监控看板的指标清单如下表所示。
看板分类 | 看板名称 | PromQL 语句 |
---|---|---|
node-local-dns 监控 | DNS 请求 | sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) |
DNS 错误率 | 100*sum(increase(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 转发 | sum(increase(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 缓存命中率 | 100*sum(increase(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0) | |
DNS 请求(按协议) | sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (proto) | |
DNS 请求(按类型) | sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (type) | |
DNS 转发 | sum(rate(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0) | |
DNS 错误率 | 100*sum(rate(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[5m]))/sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0) | |
缓存命中/未命中 | sum(rate(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) | |
sum(rate(coredns_cache_misses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) | ||
请求响应延时(P90) | 1000*histogram_quantile(0.9, sum(rate(coredns_dns_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le)) | |
转发请求响应延时(P90) | 1000*histogram_quantile(0.9, sum(rate(coredns_forward_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le)) |
说明
如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"
参数中的$Cluster
变量修改为具体的集群 ID ,或直接删除该参数。
vke-ingress-dashboard 为 ingress 监控看板,展示了集群中 ingress 实例的监控信息,包括:
ingress-nginx 监控看板的指标清单如下表所示。
看板分类 | 看板名称 | PromQL 语句 |
---|---|---|
ingress-nginx 监控 | 请求总数 | sum(nginx_ingress_controller_nginx_process_requests_total{cluster="$clusterId"}) |
连接数 | sum(avg_over_time(nginx_ingress_controller_nginx_process_connections{cluster="$clusterId",state="active"}[5m])) | |
请求成功率 | sum(rate(nginx_ingress_controller_requests{cluster="$clusterId",status!~"[4-5].*"}[2m]))/sum(rate(nginx_ingress_controller_requests{cluster="$clusterId"}[2m])) | |
配置重载次数 | avg(nginx_ingress_controller_success{cluster="$clusterId"}) | |
CPU 趋势图 | avg (rate (nginx_ingress_controller_nginx_process_cpu_seconds_total{cluster="$clusterId"}[2m]))by(controller_pod) | |
内存趋势图 | avg(nginx_ingress_controller_nginx_process_resident_memory_bytes{cluster="$clusterId"})by(controller_pod) | |
流量速率趋势图 | sum (irate (nginx_ingress_controller_request_size_sum{cluster="$clusterId"}[2m]))by(controller_pod) | |
请求处理延迟 | histogram_quantile(0.9, sum(rate(nginx_ingress_controller_response_duration_seconds_bucket{cluster="$clusterId"}[2m]))by(le,path)) | |
请求趋势图 | sum(increase(nginx_ingress_controller_requests{cluster="$clusterId"}[15s]))by(controller_pod) | |
请求数趋势图 by path | sum(increase(nginx_ingress_controller_requests{cluster="$clusterId"}[15s]))by(path) | |
错误数趋势图 by path | sum(increase(nginx_ingress_controller_requests{cluster="$clusterId",status!~"2.."}[15s]))by(path) | |
错误数趋势图 by http code | sum(increase(nginx_ingress_controller_requests{cluster="$clusterId",status!~"2.."}[15s]))by(status) |
说明
如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"
参数中的$Cluster
变量修改为具体的集群 ID ,或直接删除该参数。