You need to enable JavaScript to run this app.
导航

核心组件监控

最近更新时间2023.12.14 15:02:39

首次发布时间2023.11.17 15:31:30

托管 Prometheus 控制台中预置了常见的 VKE 集群监控看板,本文为您介绍核心组件看板信息。

vke-apiserver-dashboard

vke-apiserver-dashboard 为 kube-apiserver 组件的监控看板,展示了该控制面组件的监控信息。包括:APIServer QPS、读请求成功率、写请求成功率、请求延迟等。
alt

kube-apiserver 组件监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
ApiServerAPIServer QPSsum(irate(apiserver_request_total{cluster="$clusterId"}[1m]))
读请求成功率sum(irate(apiserver_request_total{cluster="$clusterId",code=~"20.*",verb=~"GET|LIST"}[1m]))/sum(irate(apiserver_request_total{cluster="$clusterId",verb=~"GET|LIST"}[1m]))*100
写请求成功率sum(irate(apiserver_request_total{cluster="$clusterId",code=~"20.*",verb!~"GET|LIST|WATCH|CONNECT"}[5m]))/sum(irate(apiserver_request_total{cluster="$clusterId",verb!~"GET|LIST|WATCH|CONNECT"}[5m]))*100
正在处理的读请求数(当前值)sum(apiserver_current_inflight_requests{cluster="$clusterId",request_kind="readOnly"})
正在处理的写请求数(当前值)sum(apiserver_current_inflight_requests{cluster="$clusterId",request_kind="mutating"})
废弃 api 的请求数sum(apiserver_requested_deprecated_apis{cluster="$clusterId"})
GET 的请求延迟 [P90]histogram_quantile(0.90, sum(irate(apiserver_request_duration_seconds_bucket{cluster="$clusterId",verb="GET",resource!="",subresource!~"log|proxy"}[1m])) by ( verb, instance,resource, le))
LIST 的请求延迟 [P90]histogram_quantile(0.90, sum(irate(apiserver_request_duration_seconds_bucket{cluster="$clusterId",verb="LIST",resource!="",subresource!~"log|proxy"}[1m])) by ( verb, instance,resource, le))
写请求延迟 [P90]histogram_quantile(0.90, sum(irate(apiserver_request_duration_seconds_bucket{cluster="$clusterId",verb!~"GET|WATCH|LIST|CONNECT",resource!="",subresource!~"log|proxy"}[1m])) by ( verb, instance,resource, le))
APIServer 非 2XX 返回值的读请求sum by (instance, resource)(increase(apiserver_request_total{cluster="$clusterId",verb="GET", resource!="", code!~"2.."}[1m]) )
APIServer 非 2XX 返回值的写请求sum by (instance, resource)(increase(apiserver_request_total{cluster="$clusterId",verb=~"^(POST|PUT|PATCH|DELETE)$", resource!="", code!~"2.."}[1m]) )
APIServer 在处理的读请求数sum(apiserver_current_inflight_requests{cluster="$clusterId",request_kind="readOnly"})by(instance)
APIServer 在处理的写请求数sum(apiserver_current_inflight_requests{cluster="$clusterId",request_kind="mutating"})by(instance)
Admit 准入控制器时延 [P90]histogram_quantile(0.9, sum by(operation, le, type, rejected) (irate(apiserver_admission_controller_admission_duration_seconds_bucket{cluster="$clusterId",type="admit"}[1m])))
Validate 准入控制器时延 [P90]histogram_quantile(0.9, sum by(operation, le, type, rejected) (irate(apiserver_admission_controller_admission_duration_seconds_bucket{cluster="$clusterId",type="validate"}[5m])))
Admit 准入 Webhook 时延 [P90]histogram_quantile(0.90, sum by(operation, type,le, rejected) (rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster="$clusterId",type="admit"}[5m])))
Validate 准入 Webhook 时延 [P90]histogram_quantile(0.90, sum by(operation,le, type, rejected) (rate(apiserver_admission_webhook_admission_duration_seconds_bucket{cluster="$clusterId",type="validating"}[5m])))
UserAgent 请求 QPS Top 10topk(10,sum(rate(apiserver_request_useragent_total{cluster="$ClusterId"}[5m])) by(useragent,verb,resource))
UserAgent List 请求 QPS Top 10topk(10,sum(rate(apiserver_request_useragent_total{cluster="$ClusterId",verb="LIST"}[5m])) by(useragent,resource))

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-etcd-dashboard

vke-etcd-dashboard 为 ETCD 组件的监控看板,展示了该控制面组件的监控信息。包括:has leader、Backend commit 平均时延 [P90]、Proposal failed、Proposal pending 等。
alt

ETCD 组件监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
Etcdhas leaderetcd_server_has_leader{cluster="$clusterId"}
Backend commit 平均时延 [P90]histogram_quantile(0.9, avg(rate(etcd_disk_backend_commit_duration_seconds_bucket{cluster="$clusterId"}[5m]))by (instance, le))
Proposal failedsum(etcd_server_proposals_failed_total{cluster="$clusterId"})
Proposal pendingsum(etcd_server_proposals_pending{cluster="$clusterId"})
Backend commit 成功率sum(etcd_server_proposals_applied_total{cluster="$clusterId"})by(instance)/sum(etcd_server_proposals_committed_total{cluster="$clusterId"})by(instance)*100
Backend commit 时延 [P90]histogram_quantile(0.9, sum(rate(etcd_disk_backend_commit_duration_seconds_bucket{cluster="$clusterId"}[5m]))by (instance, le))
使用量的趋势etcd_mvcc_db_total_size_in_use_in_bytes{cluster="$clusterId"}
WAL fsync 操作耗时histogram_quantile(0.90, rate(etcd_disk_wal_fsync_duration_seconds_bucket{cluster="$clusterId"}[5m]))
Proposal failed 数量趋势etcd_server_proposals_failed_total{cluster="$clusterId"}
Proposal pending 数量趋势etcd_server_proposals_pending{cluster="$clusterId"}
Proposal commit - apply 数量趋势abs(etcd_server_proposals_committed_total{cluster="$clusterId"}-etcd_server_proposals_applied_total{cluster="$clusterId"})

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-scheduler-dashboard

vke-scheduler-dashboard 为 kube-scheduler 组件的监控看板,展示了该控制面组件的监控信息。包括:存活的调度器实例、处于 Pending Phase 的 Pods 数、请求 APIServer 的 P90 时延等。
alt

kube-scheduler 组件监控看板的指标清单如下表所示。

看板分类看板名称PromQL 语句
Scheduler存活的调度器实例count(sum(scheduler_pod_scheduling_attempts_sum{cluster="$clusterId"})by(instance))
处于 Pending Phase 的 Pods 数sum(scheduler_pending_pods{cluster="$clusterId",instance=~"$instances"})
请求 APIServer 的 P90 时延sum(histogram_quantile(0.9, sum(rate(rest_client_request_duration_seconds_bucket{cluster="$clusterId"}[5m])) by (verb,url,le)))
尝试抢占的次数sum(scheduler_preemption_attempts_total{cluster="$clusterId",instance=~"$instances"})
缓存中的资源数量sum(scheduler_scheduler_cache_size{cluster="$clusterId",type="assumed_pods",instance=~"$instances"})by(type)
Pending Pods 数量sum(scheduler_pending_pods{cluster="$clusterId",instance=~"$instances"})by(queue)
各阶段 Pod 调度数量sum(increase(scheduler_scheduling_algorithm_duration_seconds_count{cluster="$clusterId",instance=~"$instances"}[1m]))
各阶段 Pod 调度耗时histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{cluster="$clusterId",instance=~"$instances"}[5m])) by (le))
请求 APIServer 的 QPSsum(rate(rest_client_requests_total{cluster="$clusterId"}[1m])) by (method,code)
请求 APIServer 的时延趋势图 [P90]histogram_quantile(0.9, sum(rate(rest_client_request_duration_seconds_bucket{cluster="$clusterId"}[1m])) by (verb,url,le))

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-ca-dashboard

vke-ca-dashboard 为 cluster-autoscaler 组件的监控看板,展示了该控制面组件的监控信息。包括:集群伸缩状态、是否处于缩容冷却、最近检查扩容时间、弹性伸缩耗时 [P99] 等。
alt

cluster-autoscaler 组件监控看板的指标清单如下表所示。

大盘分类大盘名称PromQL 语句
ClusterAutoscaler集群伸缩状态max(cluster_autoscaler_cluster_safe_to_autoscale{cluster="$ClusterId"}) or vector(0)
是否处于缩容冷却sum(cluster_autoscaler_scale_down_in_cooldown{cluster="$ClusterId"}) or vector(0)
最近检查扩容时间1000*max(cluster_autoscaler_last_activity{cluster="$ClusterId",activity="scaleUp"})
最近检查缩容时间1000*max(cluster_autoscaler_last_activity{cluster="$ClusterId",activity="scaleDown"})
Unschedulable Podsum(cluster_autoscaler_unschedulable_pods_count{cluster="$ClusterId"})by(cluster)
count(kube_pod_status_phase{cluster="$ClusterId",phase="Running"})
弹性伸缩耗时[P99]histogram_quantile(0.99, sum(rate(cluster_autoscaler_function_duration_seconds_bucket{cluster="$ClusterId",function="main"}[1m])) by (le))
CA 添加节点数sum(increase(cluster_autoscaler_scaled_up_nodes_total{cluster="$ClusterId"}[15s]))
sum(increase(cluster_autoscaler_scaled_up_gpu_nodes_total{cluster="$ClusterId"}[15s]))
CA 移除节点数sum(increase(cluster_autoscaler_scaled_down_nodes_total{cluster="$ClusterId"}[15s])) by (reason)
sum(increase(cluster_autoscaler_scaled_down_gpu_nodes_total{cluster="$ClusterId"}[15s]))by(reason)
CA 驱逐 Pod 数sum(increase(cluster_autoscaler_evicted_pods_total{cluster="$ClusterId"}[15s]))
CA 扩容失败次数sum(increase(cluster_autoscaler_failed_scale_ups_total{cluster="$ClusterId"}[15s]))by(reason)

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。