You need to enable JavaScript to run this app.
导航
MLP 在 VMP 的预置看板介绍
最近更新时间:2024.09.13 17:31:13首次发布时间:2024.09.13 17:31:13

机器学习平台在 VMP 上配有多个预置看板,下面提供各看板变量及PromQL查询语句,方便您快速使用。

自定义任务看板 mlp-customtask-dashboard

变量

变量名显示名称查询语句正则
ResourceGroup资源组mlp_resourcegroup_info/.*mlp_resource_group=\"(.*?)\".*/
ResourceQueue队列mlp_resourcequeue_info/.*mlp_resource_queue=\"(.*?)\".*/
CustomTask自定义任务mlp_customtask_info{mlp_resource_queue=~"$ResourceQueue"}/.*mlp_customtask=\"(.*?)\"/
Instance实例mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}/.*pod=\"(.*?)\".*/
TopKTopK/BottomK筛选数量--

PromQL查询语句

GPU资源总量

sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"})

CPU资源总量 (core)

sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"})

Memory资源总量

sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"})

实例总数

count(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

运行中实例总数

sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) mlp_instance_pod_status_phase{phase="Running"})

Ready 实例总数

sum(mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} * ON(pod) mlp_instance_pod_status_ready{ready="True"})

GPU利用率(SM_ACTIVE)

Avg: avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )

显存利用率

Avg: avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

NVLINK流入带宽

Avg: avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

NVLINK流出带宽

Avg: avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
Max: max(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )
Min: min(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} )

GPU PCIE流入带宽

Avg: avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

GPU PCIE流出带宽

Avg: avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

出现XID错误GPU卡数

count(DCGM_FI_DEV_XID_ERRORS AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) - count((DCGM_FI_DEV_XID_ERRORS == 0) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

显存已停用页面计数

SBE: sum(DCGM_FI_DEV_RETIRED_SBE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
DBE: sum(DCGM_FI_DEV_RETIRED_DBE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

CPU利用率

Avg: avg((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((rate(container_cpu_usage_seconds_total[30s]) / ON(pod) container_spec_cpu_shares * 1000) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

内存利用率

Avg: avg((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Max: max((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})
Min: min((container_memory_working_set_bytes / container_spec_memory_limit_bytes) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"})

平均GPU利用率(SM_ACTIVE) TopK实例

{{pod}} : topk($TopK, avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))

平均显存利用率 TopK实例

{{pod}} : topk($TopK, avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))

平均GPU利用率(SM_ACTIVE) BottomK实例

{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_SM_ACTIVE AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))

平均显存利用率 BottomK实例

{{pod}} : bottomk($TopK, avg((DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"}) by (pod))

平均NVLINK流入带宽 TopK实例

{{pod}} : topk($TopK, avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均NVLINK流出带宽 TopK实例

{{pod}} : topk($TopK, avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均PCIE流入带宽 TopK实例

{{pod}} : topk($TopK, avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均PCIE流出带宽 TopK实例

{{pod}} : topk($TopK, avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均NVLINK流入带宽 BottomK实例

{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_NVLINK_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均NVLINK流出带宽 BottomK实例

{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_NVLINK_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均PCIE流入带宽 BottomK实例

{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_PCIE_RX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

平均PCIE流出带宽 BottomK实例

{{pod}} : bottomk($TopK, avg(DCGM_FI_PROF_PCIE_TX_BYTES AND ON(pod) mlp_customtask_instance_info{mlp_customtask=~"$CustomTask"} ) by (pod))

实例GPU利用率(SM_ACTIVE)

Avg: avg(DCGM_FI_PROF_SM_ACTIVE{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_SM_ACTIVE{pod=~"$Instance"}

实例显存利用率

Avg: avg(DCGM_FI_DEV_FB_USED{pod=~"$Instance"} / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED))
GPU{{gpu}}: DCGM_FI_DEV_FB_USED{pod=~"$Instance"} / (DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED)

实例内分卡NVLINK流入带宽

Avg: avg(DCGM_FI_PROF_NVLINK_RX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_NVLINK_RX_BYTES{pod=~"$Instance"}

实例内分卡NVLINK流出带宽

Avg: avg(DCGM_FI_PROF_NVLINK_TX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_NVLINK_TX_BYTES{pod=~"$Instance"}

实例内分卡PCIE流入带宽

Avg: avg(DCGM_FI_PROF_PCIE_RX_BYTES{pod=~"$Instance"})
GPU{{gpu}}: DCGM_FI_PROF_PCIE_RX_BYTES{pod=~"$Instance"}

实例内分卡PCIE流出带宽

Avg: DCGM_FI_PROF_PCIE_TX_BYTES{pod=~"$Instance"}
GPU{{gpu}}: DCGM_FI_PROF_PCIE_TX_BYTES{pod=~"$Instance"}

实例 XID Error 详细信息

GPU{{gpu}}: DCGM_FI_DEV_XID_ERRORS{pod=~"$Instance"}

实例显存已停用页面分卡计数

GPU{{gpu}}: DBE  : DCGM_FI_DEV_RETIRED_DBE{pod=~"$Instance"} 
GPU{{gpu}}: SBE  : DCGM_FI_DEV_RETIRED_SBE{pod=~"$Instance"}

实例CPU利用率

(rate(container_cpu_usage_seconds_total{pod=~"$Instance"}[30s]) / ON(pod) container_spec_cpu_shares{pod=~"$Instance"} * 1000)

实例内存饱和度

container_memory_working_set_bytes{pod=~"$Instance"} / container_spec_memory_limit_bytes

队列资源看板 mlp-gpu-resource-dashboard

变量

变量名显示名称查询语句正则
ResourceGroup资源组mlp_resourcegroup_info/.*mlp_resource_group=\"(.*?)\".*/
ResourceQueue队列mlp_resourcequeue_info{mlp_resource_group=~"$ResourceGroup"}/.*mlp_resource_queue=\"(.*?)\".*/
TopKTop/Bottom k值

PromQL查询语句

GPU资源总量

sum(mlp_resourcequeue_quota_capacity{mlp_resource_queue=~"$ResourceQueue", resource="GPU"})

GPU申请量

sum(mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="GPU"})
自定义任务数
sum(mlp_customtask_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

自定义任务实例数

sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

在线服务数

sum(mlp_deployment_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

在线服务实例数

sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

开发机数

sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

开发机实例数

sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"}) or vector(0)

队列GPU申请率

总和申请率

sum(mlp_resourcequeue_quota_allocated{resource="GPU", mlp_resource_queue=~"$ResourceQueue"}) by (mlp_resource_queue) / sum(mlp_resourcequeue_quota_capacity{resource="GPU"}) by (mlp_resource_queue)

自定义任务

sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"})

开发机

sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"}) 

在线服务

sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="GPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="GPU", mlp_resource_queue=~"$ResourceQueue"})

负载类型平均GPU利用率(SM_ACTIVE)

均值

avg ( mlp_instance_info{mlp_resource_queue=~"$ResourceQueue"} *  ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)

自定义任务

avg ( mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} *  ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)

在线服务

avg ( mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} *  ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)

开发机

avg ( mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} *  ON(pod) group_right(mlp_resource_queue) DCGM_FI_PROF_SM_ACTIVE ) by (mlp_resource_queue)

负载类型平均显存利用率

均值

avg(mlp_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)

自定义任务

avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)

在线服务

avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)

开发机

avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_resource_queue) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_resource_queue)

队列自定义任务GPU资源申请量 Top k

{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="GPU"}) by (mlp_customtask))

队列在线服务GPU资源申请量 Top k

{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="GPU"}) by (mlp_deployment))

队列开发机GPU资源申请量 Top k

{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="GPU"}) by (mlp_devinstance))

自定义任务GPU利用率(SM_ACTIVE) Top k

{{mlp_customtask}}: topk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) DCGM_FI_PROF_SM_ACTIVE) by (mlp_customtask))

在线服务GPU利用率(SM_ACTIVE) Top k

{{mlp_deployment}}: topk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) DCGM_FI_PROF_SM_ACTIVE) by (mlp_deployment))

开发机GPU利用率(SM_ACTIVE) Top k

{{mlp_devinstance}}: topk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) DCGM_FI_PROF_SM_ACTIVE) by (mlp_devinstance))

自定义任务GPU利用率(SM_ACTIVE) Bottom k

{{mlp_customtask}}: bottomk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) DCGM_FI_PROF_SM_ACTIVE) by (mlp_customtask))

在线服务GPU利用率(SM_ACTIVE) Bottom k

{{mlp_deployment}}: bottomk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) DCGM_FI_PROF_SM_ACTIVE) by (mlp_deployment))

开发机GPU利用率(SM_ACTIVE) Bottom k

{{mlp_devinstance}}: bottomk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) DCGM_FI_PROF_SM_ACTIVE) by (mlp_devinstance))

自定义任务显存利用率 Top k

{{mlp_customtask}}: topk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_customtask))

在线服务显存利用率 Top k

{{mlp_deployment}}: topk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_deployment))

开发机显存利用率 Top k

{{mlp_devinstance}}: topk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_devinstance))

自定义任务显存利用率 Bottom k

{{mlp_customtask}}: bottomk($Topk, avg(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_customtask))

在线服务显存利用率 Bottom k

{{mlp_deployment}}: bottomk($Topk, avg(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_deployment))

开发机显存利用率 Bottom k

{{mlp_devinstance}}: bottomk($Topk, avg(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE))) by (mlp_devinstance))

CPU资源总量 (core)

mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"}

CPU资源申请量 (core)

mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="VCPU"}

Memory资源总量

mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue", unit="GiB"}

Memory资源申请量

mlp_resourcequeue_quota_allocated{mlp_resource_queue=~"$ResourceQueue", resource="Memory"}

云盘总量

sum(mlp_resourcequeue_volume_capacity{mlp_resource_queue=~"$ResourceQueue"})

云盘容量申请量

sum(mlp_resourcequeue_volume_allocated{mlp_resource_queue=~"$ResourceQueue"})

队列CPU申请率

总和申请率

mlp_resourcequeue_quota_allocated{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"} / mlp_resourcequeue_quota_capacity

自定义任务申请率

sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"}) 

开发机申请率

sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"})

在线服务申请率

sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="VCPU"}) / sum(mlp_resourcequeue_quota_capacity{resource="VCPU", mlp_resource_queue=~"$ResourceQueue"})

队列Memory申请率

总和申请率

mlp_resourcequeue_quota_allocated{resource="Memory", mlp_resource_queue=~"$ResourceQueue"} / mlp_resourcequeue_quota_capacity

开发机申请率

sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})

自定义任务申请率

sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})

在线服务申请率

sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="MEMORY"}) / sum(mlp_resourcequeue_quota_capacity{resource="Memory", mlp_resource_queue=~"$ResourceQueue"})

队列云盘容量申请率

{{flavor}}: mlp_resourcequeue_volume_allocated / mlp_resourcequeue_volume_capacity{mlp_resource_queue=~"$ResourceQueue"}

队列自定义任务CPU资源申请量 Top k (core)

{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_customtask))

队列在线服务CPU资源申请量 Top k (core)

{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_deployment))

队列开发机CPU资源申请量 Top k (core)

{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="VCPU"}) by (mlp_devinstance))

队列自定义任务Memory资源申请量 Top k

{{mlp_customtask}}: topk($Topk, sum(mlp_customtask_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_customtask) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_customtask))

队列在线服务Memory资源申请量 Top k

{{mlp_deployment}}: topk($Topk, sum(mlp_deployment_instance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_deployment) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_deployment))

队列开发机Memory资源申请量 Top k

{{mlp_devinstance}}: topk($Topk, sum(mlp_devinstance_info{mlp_resource_queue=~"$ResourceQueue"} * ON(pod) group_right(mlp_devinstance) mlp_instance_resource_requests{resource="MEMORY"}) by (mlp_devinstance))