最近更新时间:2023.10.17 17:41:44
首次发布时间:2023.04.23 14:32:55
机器学习平台为常用的负载都提供了监控看板并预置了大量的监控指标,但仍然有可能无法满足部分用户的定制化需求,最为常见的有基于某些基础指标进行聚合得到新的指标。为解决这类问题,机器学习平台支持用户将监控数据归档至自身火山引擎账号下的托管 Prometheus 服务(VMP) ,归档后的数据用户能够自由查询及灵活使用。
当前用户拥有 MLPlatformAdminAccess
的 IAM 策略(配置策略的方法详见权限管理)。
当前账号下拥有 >=1 个 VMP 工作区。
VMPFullAccess
IAM 策略的用户前往 VMP 的控制台页面创建。前往机器学习平台的【全局配置】-【监控】模块,单击【授权】进入配置页面。
启动监控归档的功能并将适当的 VMP 工作区配置为归档位置。
提交表单后新创建的负载的监控数据将推送至对应的 VMP 工作区,用户即可实现数据的自由处理。
监控数据归档到用户的 VMP 后可以完整地查看各项指标,此外也可以利用这部分指标在 VMP 上配置告警。详细的指标和 label 列表详见下文中的指标及 label 说明。
前往 VMP 的【Explore】模块,选择用作监控数据归档的工作区。
在查询框内输入指标名称或 PromQL 查询语句。
通过 label 和 values 还能做更细粒度的筛选。
Grafana 是一个跨平台的开源度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示。您可以通过火山引擎为您提供的镜像来部署 Grafana,也可以使用开源的镜像部署。部署 Grafana 的方法如下:
方法 1:在火山引擎容器服务 VKE 中部署 Grafana,详见VMP - 部署 Grafana。
方法 2:在火山引擎云服务器 ECS 中部署 Grafana。
前往 ECS 创建实例并在实例中安装 Grafana。
为 ECS 绑定公网 IP,并设置安全组规则:入方向规则中,允许 Grafana 端口(默认为3000)。
在浏览器访问http://<公网IP>:<``Grafana 端口``>
,初始用户名和密码都为admin
。
在 Grafana 设置中新增数据源,选择 Prometheus,填入 VMP 的 Basic Auth、Query URL(上述信息均可从 VMP 的工作区详情页面获得)。
因为所有监控数据已经推送到用户的 VMP 下,用户便能够通过编写 PromQL 查询语句灵活地配置告警规则,当机器学习平台的工作负载的监控指标出现异常时可以及时地发送告警消息给对应的告警联系群组。具体操作步骤详见创建告警规则。
CPU 使用量。
irate(container_cpu_usage_seconds_total{container!=""}[5m])
CPU 使用率。
irate(container_cpu_usage_seconds_total{container!=""}[5m])/on (pod,container) (container_spec_cpu_quota/1000/100)*100
显存利用率。
avg by(gpu, pod)(DCGM_FI_DEV_FB_USED{pod="%s"} / (DCGM_FI_DEV_FB_FREE{pod="%s"} + DCGM_FI_DEV_FB_USED{pod="%s"}) * 100)
查看每个自定义任务的平均 GPU 利用率(因为自定义任务中一个任务会有多个 pod,需要按任务 id 聚合一下)。
avg by (task_id) (DCGM_FI_DEV_GPU_UTIL{task_type="mljob"})
实现 10 天内某个队列(q-20211224114108-78qvl)各个自定义任务的平均利用率(各任务生命周期内的平均利用率)。
avg_over_time(avg by (task_id, user_id) ( DCGM_FI_DEV_GPU_UTIL{task_type="mljob",resource_queue_id="q-20211224114108-78qvl"} )[10d:30s])
任务 ID | 用户名 | GPU 利用率 |
---|---|---|
t-202209xxx | user1 | 50% |
t-202209xxx | user2 | 100% |
... | ... | ... |
某个任务(t-20230519123538-vr2nl)10 天内的 GPU 使用时间。
count_over_time(DCGM_FI_DEV_GPU_UTIL{task_id="t-20230519123538-vr2nl",task_type="mljob"} [10d:1m])
按照用户、任务的维度统计过去 5 天内每种 GPU 型号的总使用卡时。
count_over_time()
:计算指定时间范围内的样本数量。在该示例中用于计算过去 1 天每分钟内的样本数量。DCGM_FI_DEV_GPU_UTIL{task_type="mljob"}
: 该示例中选择了 DCGM_FI_DEV_GPU_UTIL
指标用于计算使用时间,并且只考虑task_type
标签为 mljob
(自定义任务)的指标数据。sum by (user_id, task_id, modelName)
: 对于相同 user_id
、task_id
和 modelName
标签值的数据点,将它们的计数值相加得到一个总数。
modelName
为 DCGM 自带的标签,指 GPU 型号。sum by (user_id, task_id, modelName) (count_over_time(DCGM_FI_DEV_GPU_UTIL{task_type="mljob"} [5d:1m]))
按照用户的维度统计过去 5 天内每种 GPU 型号的总使用卡时。
sum by (user_id, modelName) (count_over_time(DCGM_FI_DEV_GPU_UTIL{task_type="mljob"} [5d:1m]))
监控指标包含一些通用的 label 用于做数据筛选或聚合,label 主要分为用户 label 及系统 label。
新增 label:只要能作为 pod label 或 pod annotation 的信息均可新增为监控指标的通用 label。
剔除 / 更名 label:除 cloud_account_id
、apig_workspace_id
之外的所有 label 均可剔除或更名。
key | 含义 | 示例 |
---|---|---|
flavor_id | Pod 所用计算规格 | ml.g1.xlarge |
pod | Pod 名字 |
|
k8s_pod_start_time | Pod 开始运行时间 | 2023-04-12 07:18:57 +0000 UTC |
resource_group_id | 资源组 id | r-20230329213646-d5rpk |
resource_queue_id | 资源队列 id | q-20230329213721-qn9fp |
task_id | 任务 id(开发机 id、自定义任务 id、在线服务 id) |
|
task_type | 任务类型 |
|
user_id | 子用户id |
|
mljob_task_name | 自定义任务(task_type="mljob")特有,pod在该自定义任务中的子任务(角色)名 | "worker"、"ps" |
mljob_replica_index | 自定义任务(task_type="mljob")特有,pod在该自定义任务中的子任务(角色)编号 | "0"、"1" |
key | 含义 | 示例 |
---|---|---|
cloud_account_id | 主账号id | 2100000050 |
job | 指标来源,cadvisor 和 dcgm(gpu metrics) |
|
apig_workspace_id | 投递监控工作区id | b98624b3-a809-4da7-b619-eea6c745dea2 |
k8s_namespace_name | k8s集群namespace | mlplatform-customtask |
k8s_node_name | pod所属node name | 192.168.0.105 |
k8s_pod_name_namespace | pod name + namespace,唯一辩识 | t-20230412151840-j7mn9-worker-0.mlplatform-customtask |
k8s_pod_name | Pod name |
kublet 暴露的容器的运行统计信息,主要包含 CPU、内存、网络等非 GPU 的监控指标。
以下指标为 container 为单位,在线服务、开发机和自定义任务均为单容器 Pod,同一指标一个 Pod 会有三条数据:
Pod 总体
infra 容器:pause
用户的业务主容器:用户仅需关心该容器的监控数据即可,不同负载类型的主容器名不同。
开发机主容器名:main。
自定义任务主容器名:mljob。
在线服务主容器名:{task_id},task_id 为用户 label。
通过 {container!=""}
可过滤出主容器的数据,如container_cpu_usage_seconds_total{container!=""}
。
指标名称 | 类型 | 单位 | 含义 |
---|---|---|---|
container_cpu_cfs_periods_total | Counter | Number of elapsed enforcement period intervals | |
container_cpu_cfs_throttled_periods_total | Counter | Number of throttled period intervals | |
container_cpu_cfs_throttled_seconds_total | Counter | seconds | Total time duration the container has been throttled |
container_cpu_load_average_10s | Gauge | Value of container cpu load average over the last 10 seconds | |
container_cpu_system_seconds_total | Counter | seconds | Cumulative system cpu time consumed |
container_cpu_usage_seconds_total | Counter | seconds | Cumulative cpu time consumed |
container_cpu_user_seconds_total | Counter | seconds | Cumulative user cpu time consumed |
container_file_descriptors | Gauge | Number of open file descriptors for the container | |
container_fs_reads_bytes_total | Counter | bytes | Cumulative count of bytes read |
container_fs_reads_total | Counter | Cumulative count of reads completed | |
container_fs_writes_bytes_total | Counter | bytes | Cumulative count of bytes written |
container_fs_writes_total | Counter | Cumulative count of writes completed | |
container_last_seen | Gauge | timestamp | Last time a container was seen by the exporter |
container_memory_cache | Gauge | bytes | Total page cache memory |
container_memory_failcnt | Counter | Number of memory usage hits limits | |
container_memory_failures_total | Counter | Cumulative count of memory allocation failures | |
container_memory_mapped_file | Gauge | bytes | Size of memory mapped files |
container_memory_max_usage_bytes | Gauge | bytes | Maximum memory usage recorded |
container_memory_rss | Gauge | bytes | Size of RSS |
container_memory_swap | Gauge | bytes | Container swap usage |
container_memory_usage_bytes | Gauge | bytes | Current memory usage, including all memory regardless of when it was accessed |
container_memory_working_set_bytes | Gauge | bytes | Current working set |
container_network_receive_bytes_total | Counter | bytes | Cumulative count of bytes received |
container_network_receive_errors_total | Counter | Cumulative count of errors encountered while receiving | |
container_network_receive_packets_dropped_total | Counter | Cumulative count of packets dropped while receiving | |
container_network_receive_packets_total | Counter | Cumulative count of packets received | |
container_network_transmit_bytes_total | Counter | bytes | Cumulative count of bytes transmitted |
container_network_transmit_errors_total | Counter | Cumulative count of errors encountered while transmitting | |
container_network_transmit_packets_dropped_total | Counter | Cumulative count of packets dropped while transmitting | |
container_network_transmit_packets_total | Counter | Cumulative count of packets transmitted | |
container_processes | Gauge | Number of processes running inside the container | |
container_sockets | Gauge | Number of open sockets for the container | |
container_spec_cpu_period | Gauge | CPU period of the container | |
container_spec_cpu_quota | Gauge | CPU quota of the container | |
container_spec_cpu_shares | Gauge | CPU share of the container | |
container_spec_memory_limit_bytes | Gauge | bytes | Memory limit for the container |
container_spec_memory_reservation_limit_bytes | Gauge | bytes | Memory reservation limit for the container |
container_spec_memory_swap_limit_bytes | Gauge | bytes | Memory swap limit for the container |
container_start_time_seconds | Gauge | seconds | Start time of the container since unix epoch |
container_tasks_state | Gauge | Number of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting) | |
container_threads | Gauge | Number of threads running inside the container | |
container_threads_max | Gauge | Maximum number of threads allowed inside the container | |
container_ulimits_soft | Gauge | Soft ulimit values for the container root process. Unlimited if -1, except priority and nice |
DCGM(NVIDIA Data Center GPU Manager)采集的如 利用率、SM 利用率、PCIe / NVLink 流量速率等 GPU 相关监控指标。如需了解指标的更多信息详见Profiling Metrics、Field Identifiers。
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_DEC_UTIL | Gauge | % | Decoder Utilization |
DCGM_FI_DEV_ENC_UTIL | Gauge | % | Encoder Utilization |
DCGM_FI_DEV_GPU_UTIL | Gauge | % | GPU Utilization |
DCGM_FI_DEV_MEM_COPY_UTIL | Gauge | % | Memory Utilization |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_FB_FREE | Gauge | MiB | Free Frame Buffer in MB |
DCGM_FI_DEV_FB_USED | Gauge | MiB | Used Frame Buffer in MB |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_PROF_DRAM_ACTIVE | Gauge | % | The fraction of cycles where data was sent to or received from device memory. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of device memory. An activity of 1 (100%) is equivalent to a DRAM instruction every cycle over the entire time interval (in practice a peak of ~0.8 (80%) is the maximum achievable). An activity of 0.2 (20%) indicates that 20% of the cycles are reading from or writing to device memory over the time interval. |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Gauge | % | The fraction of time any portion of the graphics or compute engines were active. The graphics engine is active if a graphics/compute context is bound and the graphics/compute pipe is busy. The value represents an average over a time interval and is not an instantaneous value. |
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTES | Counter | B/s | The rate of data transmitted / received over the PCIe bus, including both protocol headers and data payloads, in bytes per second. The value represents an average over a time interval and is not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred over 1 second, the rate is 1 GB/s regardless of the data transferred at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per lane. |
DCGM_FI_PROF_PIPE_FP16_ACTIVE 注:A10型号显卡没有此指标 | Gauge | % | The fraction of cycles the FP16 (half precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP16 cores. An activity of 1 (100%) is equivalent to a FP16 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
DCGM_FI_PROF_PIPE_FP32_ACTIVE 注:A10型号显卡没有此指标 | Gauge | % | The fraction of cycles the FMA (FP32 (single precision), and integer) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP32 cores. An activity of 1 (100%) is equivalent to a FP32 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
DCGM_FI_PROF_PIPE_FP64_ACTIVE 注:A10型号显卡没有此指标 | Gauge | % | The fraction of cycles the FP64 (double precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP64 cores. An activity of 1 (100%) is equivalent to a FP64 instruction on every SM every fourth cycle on Volta over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | Gauge | % | The fraction of cycles the tensor (HMMA / IMMA) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the Tensor Cores. An activity of 1 (100%) is equivalent to issuing a tensor instruction every other cycle for the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities). |
DCGM_FI_PROF_SM_ACTIVE | Gauge | % | The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. Note that “active” does not necessarily mean a warp is actively computing. For instance, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage. Given a simplified GPU architectural view, if a GPU has N SMs then a kernel using N blocks that runs over the entire time interval will correspond to an activity of 1 (100%). A kernel using N/5 blocks that runs over the entire time interval will correspond to an activity of 0.2 (20%). A kernel using N blocks that runs over one fifth of the time interval, with the SMs otherwise idle, will also have an activity of 0.2 (20%). The value is insensitive to the number of threads per block (see DCGM_FI_PROF_SM_OCCUPANCY). |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_SM_CLOCK | Gauge | MHz | SM clock for the device |
DCGM_FI_DEV_MEM_CLOCK | Gauge | MHz | Memory clock for the device |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_XID_ERRORS | Gauge | - | XID errors. The value is the specific XID error |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_MEMORY_TEMP | Gauge | C | Memory temperature for the device |
DCGM_FI_DEV_GPU_TEMP | Gauge | C | Current temperature readings for the device, in degrees C |
DCGM_FI_DEV_POWER_USAGE | Gauge | W | Power usage for the device in Watts |
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | Counter | J | Total energy consumption for the GPU in mJ since the driver was last reloaded |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS | counter | - | Number of remapped rows for uncorrectable errors |
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS | counter | - | Number of remapped rows for correctable errors |
DCGM_FI_DEV_ROW_REMAP_FAILURE | gauge | - | Whether remapping of rows has failed |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_PCIE_REPLAY_COUNTER | counter | - | Total number of PCIe retries. |
指标名称 | 指标类型 | 单位 | 说明 |
---|---|---|---|
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | counter | - | Total number of NVLink bandwidth counters for all lanes |
指标类型 | 单位 | 说明 | |
---|---|---|---|
DCGM_FI_DEV_MEMORY_TEMP | Gauge | C | Memory temperature for the device |