You need to enable JavaScript to run this app.
导航

GPU 监控

最近更新时间2023.12.14 15:02:39

首次发布时间2023.11.17 15:31:30

托管 Prometheus 控制台中预置了常见的 VKE 集群监控看板,本文为您介绍集群 GPU 监控看板信息。

vke-pod-gpu-dashboard

vke-pod-gpu-dashboard 为容器组 GPU 监控看板,展示了容器组的 GPU 监控信息,包括:GPU 使用率、GPU 显存使用率、GPU 显存用量等。

alt

容器组 GPU 监控看板的指标清单如下表所示。

看板分类看板名称指标单位PromQL 语句
容器组 GPU 监控GPU 使用率%DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"}
GPU 显存使用率%(DCGM_FI_DEV_FB_USED{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"}/(DCGM_FI_DEV_FB_USED{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"} + DCGM_FI_DEV_FB_FREE{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"}))*100
GPU 显存用量ByteDCGM_FI_DEV_FB_USED{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"}
GPU 显存未使用量ByteDCGM_FI_DEV_FB_FREE{cluster="$clusterId",namespace="$namespace",pod="$pod",job=~"dcgm|dcgm-vci"}
共享 GPU 显存用量Bytenvml_container_mem_usage{cluster="$clusterId",namespace="$namespace",pod="$pod"}
共享 GPU 显存未使用量Bytenvml_container_mem_request{cluster="$clusterId",namespace="$namespace",pod="$pod"}-nvml_container_mem_usage{cluster="$clusterId",namespace="$namespace",pod="$pod"}
共享 GPU 显存使用率%nvml_container_mem_utilization{cluster="$clusterId",namespace="$namespace",pod="$pod"}

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-instance-gpu-dashboard

vke-instance-gpu-dashboard 为 GPU 实例监控看板,展示了 GPU 卡的 GPU 监控信息,包括:GPU 利用率、GPU 使用显存、GPU 温度、GPU 功耗、GPU 解码器利用率、GPU 编码器利用率等。您可以从该大盘中了解指定 GPU 卡的监控信息。

alt

GPU 实例监控看板的指标清单如下表所示。

看板分类看板名称指标单位PromQL 语句
GPU 实例监控GPU 利用率%DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU 使用显存ByteDCGM_FI_DEV_FB_USED{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
Pod GPU 显存使用率%100*DCGM_FI_DEV_FB_USED{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}、(DCGM_FI_DEV_FB_USED{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}+DCGM_FI_DEV_FB_FREE{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"})
GPU 温度DCGM_FI_DEV_GPU_TEMP{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU 功耗WDCGM_FI_DEV_POWER_USAGE{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU 时钟频率HzDCGM_FI_DEV_SM_CLOCK{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU 解码器利用率%DCGM_FI_DEV_DEC_UTIL{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU 编码器利用率%DCGM_FI_DEV_ENC_UTIL{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"}
GPU Xid (By instance)Countsum(DCGM_CUSTOM_XID_ERRORS_TOTAL_COUNTER{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"})by(gpu_id)
GPU Xid (By xid)%sum(DCGM_CUSTOM_XID_ERRORS_TOTAL_COUNTER{cluster="$clusterId",nodename=~"$node",gpu_id=~"$instance",job=~"dcgm|dcgm-vci"})by(xid)

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。

vke-cluster-gpu-dashboard

vke-cluster-gpu-dashboard 为集群 GPU 监控看板,展示了集群的 GPU 监控信息,包括:集群总 GPU 数、已分配 GPU 数、已使用 GPU 数、集群 GPU 分配率等。您可以从该大盘中了解整个集群的 GPU 监控信息。
alt

GPU 监控看板的指标清单如下表所示。

看板分类看板名称指标单位PromQL 语句
集群 GPU 监控集群总 GPU 数Countsum(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"})
已分配 GPU 数Countsum(kube_pod_container_resource_limits{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"})
已使用 GPU 数Countcount(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"}>0)
故障的 GPU 数Countcount(DCGM_FI_DEV_XID_ERRORS{cluster="$clusterId",job=~"dcgm"}>0) or on() vector(0)
GPU 节点数Countcount(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"}>0)
GPU 故障节点数Countcount(count(DCGM_FI_DEV_XID_ERRORS{cluster="$clusterId",job=~"dcgm"}>0)by(nodename)) or on() vector(0)
8 GPU 卡可用节点Countcount(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""})by(nodename) ==8) or on() vector(0)
碎片节点数Count(count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""})by(nodename) ==1 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) >3)) or on() vector(0) )+ (count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""})by(nodename) ==2 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) >3) ) or on() vector(0)) + (count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""}) by(nodename) ==1 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) ==2) ) or on() vector(0))
集群 GPU 分配率%sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu",cluster="$clusterId",node!~"vci.*"})/sum(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"})by(cluster)*100
集群 GPU 利用率%count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"}>0)by(cluster)/sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu",cluster="$clusterId",node!~"vci.*"})*100
集群 GPU 故障率%100*count(DCGM_FI_DEV_XID_ERRORS{cluster="$clusterId",job=~"dcgm"}>0)by(cluster)/sum(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"}) or on() vector(0)
GPU 节点故障率%100*count(count(DCGM_FI_DEV_XID_ERRORS{cluster="$clusterId",job=~"dcgm"}>0)by(nodename))/count(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"}>0) or on() vector(0)
节点空闲率%(count(kube_node_status_capacity{resource="nvidia_com_gpu", cluster="$clusterId",node!~"vci.*"}>0)-count(sum(kube_pod_container_resource_requests{resource="nvidia_com_gpu",cluster="$clusterId",node!~"vci.*"}) by(node) >0))/count(kube_node_status_capacity{resource="nvidia_com_gpu",cluster="$clusterId",node !~"vci.*"}>0) *100
节点碎片率%((count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""}) by(nodename) ==1 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) >3)) or on() vector(0) )+(count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""}) by(nodename) ==2 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) >3) ) or on() vector(0)) + (count(count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm",pod= ""}) by(nodename) ==1 and (count(DCGM_FI_DEV_GPU_UTIL{cluster="$clusterId",job=~"dcgm"})by(nodename) ==2) ) or on() vector(0))/count(kube_node_status_capacity{resource="nvidia_com_gpu",cluster="$clusterId",node !~"vci.*"}>0)*100 ) or on() vector(0)

说明

如果您需要在托管 Prometheus 中的 Explore 功能或告警中心使用上述 PromQL 语句查看具体的指标或配置告警,请修改或删除语句中关于集群、节点、容器组的变量。例如:将 cluster=~"$Cluster"参数中的$Cluster变量修改为具体的集群 ID ,或直接删除该参数。