机器学习平台在 VMP 上配有多个预置看板,下面提供各看板变量及PromQL查询语句,方便您快速使用。
自定义任务看板 mlp-customtask-dashboard
变量
变量名 | 显示名称 | 查询语句 | 正则 |
---|
ResourceGroup | 资源组 | mlp_resourcegroup_info | /.*mlp_resource_group=\"(.*?)\".*/ |
ResourceQueue | 队列 | mlp_resourcequeue_info | /.*mlp_resource_queue=\"(.*?)\".*/ |
CustomTask | 自定义任务 | mlp_customtask_info{mlp_resource_queue=~"$ResourceQueue"} | /.*mlp_customtask=\"(.*?)\"/ |
Instance | 实例 | mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} | /.*pod=\"(.*?)\".*/ |
TopK | TopK/BottomK筛选数量 | - | - |
PromQL查询语句
GPU资源总量
sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"})
CPU资源总量 (core)
sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"})
Memory资源总量
sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"})
实例总数
count(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
运行中实例总数
sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) mlp_instance_pod_status_phase{phase="Running"})
Ready 实例总数
sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) mlp_instance_pod_status_ready{ready="True"})
GPU利用率(SM_ACTIVE)
Avg: avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
显存利用率
Avg: avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
NVLINK流入带宽
Avg: avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
NVLINK流出带宽
Avg: avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
Max: max(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
Min: min(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
GPU PCIE流入带宽
Avg: avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
GPU PCIE流出带宽
Avg: avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
出现XID错误GPU卡数
count(DCGM_FI_DEV_XID_ERRORS AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) - count((DCGM_FI_DEV_XID_ERRORS == 0) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
显存已停用页面计数
SBE: sum(DCGM_FI_DEV_RETIRED_SBE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
DBE: sum(DCGM_FI_DEV_RETIRED_DBE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
CPU利用率
Avg: avg((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
内存利用率
Avg: avg((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
平均GPU利用率(SM_ACTIVE) TopK实例
{{pod}} : topk($TopK, avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))
平均显存利用率 TopK实例
{{pod}} : topk($TopK, avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))
平均GPU利用率(SM_ACTIVE) BottomK实例
{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))
平均显存利用率 BottomK实例
{{pod}} : bottomk($TopK, avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))
平均NVLINK流入带宽 TopK实例
{{pod}} : topk($TopK, avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均NVLINK流出带宽 TopK实例
{{pod}} : topk($TopK, avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均PCIE流入带宽 TopK实例
{{pod}} : topk($TopK, avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均PCIE流出带宽 TopK实例
{{pod}} : topk($TopK, avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均NVLINK流入带宽 BottomK实例
{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均NVLINK流出带宽 BottomK实例
{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均PCIE流入带宽 BottomK实例
{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
平均PCIE流出带宽 BottomK实例
{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))
实例GPU利用率(SM_ACTIVE)
Avg: avg(DCGM_FI_PROF_SM_ACTIVE{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_SM_ACTIVE{pod=~"$Instance"}
实例显存利用率
Avg: avg(DCGM_FI_DEV_FB_USED{pod=~"$Instance"} / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED))
GPU{{gpu}}: DCGM_FI_DEV_FB_USED{pod=~"$Instance"} / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)
实例内分卡NVLINK流入带宽
Avg: avg(DCGM_FI_PROF_NVLINK_RX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_NVLINK_RX_BYTES{pod=~"$Instance"}
实例内分卡NVLINK流出带宽
Avg: avg(DCGM_FI_PROF_NVLINK_TX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_NVLINK_TX_BYTES{pod=~"$Instance"}
实例内分卡PCIE流入带宽
Avg: avg(DCGM_FI_PROF_PCIE_RX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_PCIE_RX_BYTES{pod=~"$Instance"}
实例内分卡PCIE流出带宽
Avg: DCGM_FI_PROF_PCIE_TX_BYTES{pod=~"$Instance"}
GPU{{gpu}}: DCGM_FI_PROF_PCIE_TX_BYTES{pod=~"$Instance"}
实例 XID Error 详细信息
GPU{{gpu}}: DCGM_FI_DEV_XID_ERRORS{pod=~"$Instance"}
实例显存已停用页面分卡计数
GPU{{gpu}}: DBE : DCGM_FI_DEV_RETIRED_DBE{pod=~"$Instance"}
GPU{{gpu}}: SBE : DCGM_FI_DEV_RETIRED_SBE{pod=~"$Instance"}
实例CPU利用率
(rate(container_cpu_usage_seconds_total{pod=~"$Instance"}[30s]) / ON(pod) container_spec_cpu_shares{pod=~"$Instance"} * 1000)
实例内存饱和度
container_memory_working_set_bytes{pod=~"$Instance"} / container_spec_memory_limit_bytes
队列资源看板 mlp-gpu-resource-dashboard
变量
变量名 | 显示名称 | 查询语句 | 正则 |
---|
ResourceGroup | 资源组 | mlp_resourcegroup_info | /.*mlp_resource_group=\"(.*?)\".*/ |
ResourceQueue | 队列 | mlp_resourcequeue_info{mlp_resource_group=~"$ResourceGroup"} | /.*mlp_resource_queue=\"(.*?)\".*/ |
TopK | Top/Bottom k值 | | |
PromQL查询语句
GPU资源总量
sum(mlp_resourcequeue_quota_capacity{mlp_resource_queue=~"$ResourceQueue", resource="GPU"})
GPU申请量
sum(mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="GPU"})
自定义任务数
sum(mlp_customtask_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
自定义任务实例数
sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
在线服务数
sum(mlp_deployment_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
在线服务实例数
sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
开发机数
sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
开发机实例数
sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)
队列GPU申请率
总和申请率
sum(mlp_resourcequeue_quota_allocated{resource="GPU", mlp_resource_queue=~"$ResourceQueue"}) by (mlp_resource_queue) / sum(mlp_resourcequeue_quota_capacity{resource="GPU"}) by (mlp_resource_queue)
自定义任务
sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"})
开发机
sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"})
在线服务
sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"})
负载类型平均GPU利用率(SM_ACTIVE)
均值
avg ( mlp_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)
自定义任务
avg ( mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)
在线服务
avg ( mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)
开发机
avg ( mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)
负载类型平均显存利用率
均值
avg(mlp_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)
自定义任务
avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)
在线服务
avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)
开发机
avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)
队列自定义任务GPU资源申请量 Top k
{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"}) by (mlp_customtask))
队列在线服务GPU资源申请量 Top k
{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="GPU"}) by (mlp_deployment))
队列开发机GPU资源申请量 Top k
{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="GPU"}) by (mlp_devinstance))
自定义任务GPU利用率(SM_ACTIVE) Top k
{{mlp_customtask}}: topk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) DCGM_FI_PROF_SM_ACTIVE) by (mlp_customtask))
在线服务GPU利用率(SM_ACTIVE) Top k
{{mlp_deployment}}: topk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) DCGM_FI_PROF_SM_ACTIVE) by (mlp_deployment))
开发机GPU利用率(SM_ACTIVE) Top k
{{mlp_devinstance}}: topk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) DCGM_FI_PROF_SM_ACTIVE) by (mlp_devinstance))
自定义任务GPU利用率(SM_ACTIVE) Bottom k
{{mlp_customtask}}: bottomk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) DCGM_FI_PROF_SM_ACTIVE) by (mlp_customtask))
在线服务GPU利用率(SM_ACTIVE) Bottom k
{{mlp_deployment}}: bottomk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) DCGM_FI_PROF_SM_ACTIVE) by (mlp_deployment))
开发机GPU利用率(SM_ACTIVE) Bottom k
{{mlp_devinstance}}: bottomk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) DCGM_FI_PROF_SM_ACTIVE) by (mlp_devinstance))
自定义任务显存利用率 Top k
{{mlp_customtask}}: topk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_customtask))
在线服务显存利用率 Top k
{{mlp_deployment}}: topk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_deployment))
开发机显存利用率 Top k
{{mlp_devinstance}}: topk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_devinstance))
自定义任务显存利用率 Bottom k
{{mlp_customtask}}: bottomk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_customtask))
在线服务显存利用率 Bottom k
{{mlp_deployment}}: bottomk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_deployment))
开发机显存利用率 Bottom k
{{mlp_devinstance}}: bottomk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_devinstance))
CPU资源总量 (core)
mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"}
CPU资源申请量 (core)
mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="VCPU"}
Memory资源总量
mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue", unit="GiB"}
Memory资源申请量
mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="Memory"}
云盘总量
sum(mlp_resourcequeue_volume_capacity{mlp_resource_queue=~"$ResourceQueue"})
云盘容量申请量
sum(mlp_resourcequeue_volume_allocated{mlp_resource_queue=~"$ResourceQueue"})
队列CPU申请率
总和申请率
mlp_resourcequeue_quota_allocated{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"} / mlp_resourcequeue_quota_capacity
自定义任务申请率
sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"})
开发机申请率
sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"})
在线服务申请率
sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"})
队列Memory申请率
总和申请率
mlp_resourcequeue_quota_allocated{resource="Memory", mlp_resource_queue=~"$ResourceQueue"} / mlp_resourcequeue_quota_capacity
开发机申请率
sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})
自定义任务申请率
sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})
在线服务申请率
sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})
队列云盘容量申请率
{{flavor}}: mlp_resourcequeue_volume_allocated / mlp_resourcequeue_volume_capacity{mlp_resource_queue=~"$ResourceQueue"}
队列自定义任务CPU资源申请量 Top k (core)
{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_customtask))
队列在线服务CPU资源申请量 Top k (core)
{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_deployment))
队列开发机CPU资源申请量 Top k (core)
{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_devinstance))
队列自定义任务Memory资源申请量 Top k
{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_customtask))
队列在线服务Memory资源申请量 Top k
{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_deployment))
队列开发机Memory资源申请量 Top k
{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_devinstance))