You need to enable JavaScript to run this app.
导航

网络服务监控

最近更新时间2023.12.14 15:02:39

首次发布时间2023.11.17 15:31:30

托管 Prometheus 控制台中预置了常见的 VKE 集群监控看板,本文为您介绍网络服务监控看板信息。

vke-core-dns-dashboard

vke-core-dns-dashboard 为 core-dns 监控看板,展示了集群中所有或指定 core-dns 实例的监控信息,包括:DNS 请求、DNS 错误率、DNS 转发、DNS 缓存命中率、请求响应延时(P90)、转发请求响应延时(P90)等。

alt

core-dns 监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
core-dns 监控DNS 请求sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 错误率100*sum(increase(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 转发sum(increase(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 缓存命中率100*sum(increase(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 请求(按协议)sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (proto)
DNS 请求(按类型)sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (type)
DNS 转发sum(rate(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0)
DNS 错误率100*sum(rate(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[5m]))/sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0)
缓存命中/未命中sum(rate(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))
sum(rate(coredns_cache_misses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))
请求响应延时(P90)1000*histogram_quantile(0.9, sum(rate(coredns_dns_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le))
转发请求响应延时(P90)1000*histogram_quantile(0.9, sum(rate(coredns_forward_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le))

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-local-dns-dashboard

vke-local-dns-dashboard 为 node-local-dns 监控看板,展示了所有或指定 node-local-dns 实例的监控信息,包括:DNS 请求、DNS 错误率、DNS 转发、DNS 缓存命中率、请求响应延时(P90)、转发请求响应延时(P90)等。

alt

node-local-dns 监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
node-local-dns 监控DNS 请求sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 错误率100*sum(increase(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 转发sum(increase(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 缓存命中率100*sum(increase(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range]))/sum(increase(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[$__range])) or vector(0)
DNS 请求(按协议)sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (proto)
DNS 请求(按类型)sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) by (type)
DNS 转发sum(rate(coredns_forward_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0)
DNS 错误率100*sum(rate(coredns_dns_responses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns",rcode=~"SERVFAIL|REFUSED"}[5m]))/sum(rate(coredns_dns_requests_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m])) or vector(0)
缓存命中/未命中sum(rate(coredns_cache_hits_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))
sum(rate(coredns_cache_misses_total{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))
请求响应延时(P90)1000*histogram_quantile(0.9, sum(rate(coredns_dns_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le))
转发请求响应延时(P90)1000*histogram_quantile(0.9, sum(rate(coredns_forward_request_duration_seconds_bucket{cluster=~"$ClusterId",pod=~"$Pod",job="core-dns"}[5m]))by (le))

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-ingress-dashboard

vke-ingress-dashboard 为 ingress 监控看板,展示了集群中 ingress 实例的监控信息,包括:

  • 网络信息:请求总数、连接数、请求成功率、配置重载次数、请求处理延迟、请求趋势图、错误趋势图等。
  • 资源信息:CPU 趋势图、内存趋势图、流量趋势图等。

alt

ingress-nginx 监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
ingress-nginx 监控请求总数sum(nginx_ingress_controller_nginx_process_requests_total{cluster="$clusterId"})
连接数sum(avg_over_time(nginx_ingress_controller_nginx_process_connections{cluster="$clusterId",state="active"}[5m]))
请求成功率sum(rate(nginx_ingress_controller_requests{cluster="$clusterId",status!~"[4-5].*"}[2m]))/sum(rate(nginx_ingress_controller_requests{cluster="$clusterId"}[2m]))
配置重载次数avg(nginx_ingress_controller_success{cluster="$clusterId"})
CPU 趋势图avg (rate (nginx_ingress_controller_nginx_process_cpu_seconds_total{cluster="$clusterId"}[2m]))by(controller_pod)
内存趋势图avg(nginx_ingress_controller_nginx_process_resident_memory_bytes{cluster="$clusterId"})by(controller_pod)
流量速率趋势图sum (irate (nginx_ingress_controller_request_size_sum{cluster="$clusterId"}[2m]))by(controller_pod)
请求处理延迟histogram_quantile(0.9, sum(rate(nginx_ingress_controller_response_duration_seconds_bucket{cluster="$clusterId"}[2m]))by(le,path))
请求趋势图sum(increase(nginx_ingress_controller_requests{cluster="$clusterId"}[15s]))by(controller_pod)
请求数趋势图 by pathsum(increase(nginx_ingress_controller_requests{cluster="$clusterId"}[15s]))by(path)
错误数趋势图 by pathsum(increase(nginx_ingress_controller_requests{cluster="$clusterId",status!~"2.."}[15s]))by(path)
错误数趋势图 by http codesum(increase(nginx_ingress_controller_requests{cluster="$clusterId",status!~"2.."}[15s]))by(status)

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。