You need to enable JavaScript to run this app.
导航

监控

最近更新时间2023.10.17 17:41:44

首次发布时间2023.04.23 14:32:55

机器学习平台为常用的负载都提供了监控看板并预置了大量的监控指标,但仍然有可能无法满足部分用户的定制化需求,最为常见的有基于某些基础指标进行聚合得到新的指标。为解决这类问题,机器学习平台支持用户将监控数据归档至自身火山引擎账号下的托管 Prometheus 服务(VMP) ,归档后的数据用户能够自由查询及灵活使用。

相关概念
使用前提
  1. 当前用户拥有 MLPlatformAdminAccess 的 IAM 策略(配置策略的方法详见权限管理)。

  2. 当前账号下拥有 >=1 个 VMP 工作区。

    • 如果该账号下未创建过工作时,可能联系具备相关权限如 VMPFullAccess IAM 策略的用户前往 VMP 的控制台页面创建。
配置监控归档
  1. 前往机器学习平台的【全局配置】-【监控】模块,单击【授权】进入配置页面。

  2. 启动监控归档的功能并将适当的 VMP 工作区配置为归档位置。

  3. 提交表单后新创建的负载的监控数据将推送至对应的 VMP 工作区,用户即可实现数据的自由处理。

注意事项

  1. 开启归档之前创建的自定义任务、在线服务的监控数据不会推送至 VMP,但是开发机关机再开机后可推送。
  2. 修改归档位置后,旧的自定义任务、在线服务只会推送到旧的归档位置,开发机关机再开机后可推送到新的归档位置。
如何使用归档后数据

监控数据归档到用户的 VMP 后可以完整地查看各项指标,此外也可以利用这部分指标在 VMP 上配置告警。详细的指标和 label 列表详见下文中的指标及 label 说明

VMP Explore 指标检索

  1. 前往 VMP 的【Explore】模块,选择用作监控数据归档的工作区。

  2. 在查询框内输入指标名称或 PromQL 查询语句。

  3. 通过 label 和 values 还能做更细粒度的筛选。
    alt

自建 Grafana 监控看板

Grafana 是一个跨平台的开源度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示。您可以通过火山引擎为您提供的镜像来部署 Grafana,也可以使用开源的镜像部署。部署 Grafana 的方法如下:

  • 方法 1:在火山引擎容器服务 VKE 中部署 Grafana,详见VMP - 部署 Grafana

  • 方法 2:在火山引擎云服务器 ECS 中部署 Grafana。

    • 前往 ECS 创建实例并在实例中安装 Grafana。

    • 为 ECS 绑定公网 IP,并设置安全组规则:入方向规则中,允许 Grafana 端口(默认为3000)。

    • 在浏览器访问http://<公网IP>:<``Grafana 端口``>,初始用户名和密码都为admin

    • 在 Grafana 设置中新增数据源,选择 Prometheus,填入 VMP 的 Basic Auth、Query URL(上述信息均可从 VMP 的工作区详情页面获得)。

通过 VMP 配置告警

因为所有监控数据已经推送到用户的 VMP 下,用户便能够通过编写 PromQL 查询语句灵活地配置告警规则,当机器学习平台的工作负载的监控指标出现异常时可以及时地发送告警消息给对应的告警联系群组。具体操作步骤详见创建告警规则

PromQL 检索示例

  • CPU 使用量。

    irate(container_cpu_usage_seconds_total{container!=""}[5m])
    
  • CPU 使用率。

    irate(container_cpu_usage_seconds_total{container!=""}[5m])/on (pod,container) (container_spec_cpu_quota/1000/100)*100
    
  • 显存利用率。

    avg by(gpu, pod)(DCGM_FI_DEV_FB_USED{pod="%s"} / (DCGM_FI_DEV_FB_FREE{pod="%s"} + DCGM_FI_DEV_FB_USED{pod="%s"}) * 100)
    
  • 查看每个自定义任务的平均 GPU 利用率(因为自定义任务中一个任务会有多个 pod,需要按任务 id 聚合一下)。

    avg by (task_id) (DCGM_FI_DEV_GPU_UTIL{task_type="mljob"})
    
  • 实现 10 天内某个队列(q-20211224114108-78qvl)各个自定义任务的平均利用率(各任务生命周期内的平均利用率)。

    avg_over_time(avg by (task_id, user_id) (
                    DCGM_FI_DEV_GPU_UTIL{task_type="mljob",resource_queue_id="q-20211224114108-78qvl"}
                )[10d:30s])
    
    • 最终将得到如下报表:因为以 (task_id, user_id) 做 avg 所以最后只会剩下 (task_id, user_id) 的 label。
      任务 ID用户名GPU 利用率
      t-202209xxxuser150%
      t-202209xxxuser2100%
      .........
  • 某个任务(t-20230519123538-vr2nl)10 天内的 GPU 使用时间。

    • 该任务使用了多少张 GPU 则监控图表中展示多少条数据曲线。
    count_over_time(DCGM_FI_DEV_GPU_UTIL{task_id="t-20230519123538-vr2nl",task_type="mljob"} [10d:1m])
    
  • 按照用户、任务的维度统计过去 5 天内每种 GPU 型号的总使用卡时。

    • 查询语句解析:
      • count_over_time():计算指定时间范围内的样本数量。在该示例中用于计算过去 1 天每分钟内的样本数量。
      • DCGM_FI_DEV_GPU_UTIL{task_type="mljob"}: 该示例中选择了 DCGM_FI_DEV_GPU_UTIL 指标用于计算使用时间,并且只考虑task_type 标签为 mljob(自定义任务)的指标数据。
      • sum by (user_id, task_id, modelName): 对于相同 user_idtask_idmodelName 标签值的数据点,将它们的计数值相加得到一个总数。
        • 其中 modelName 为 DCGM 自带的标签,指 GPU 型号。
    sum by (user_id, task_id, modelName) (count_over_time(DCGM_FI_DEV_GPU_UTIL{task_type="mljob"} [5d:1m]))
    
  • 按照用户的维度统计过去 5 天内每种 GPU 型号的总使用卡时。

    sum by (user_id, modelName) (count_over_time(DCGM_FI_DEV_GPU_UTIL{task_type="mljob"} [5d:1m]))
    

指标及 label 说明

监控指标包含一些通用的 label 用于做数据筛选或聚合,label 主要分为用户 label 及系统 label。

  • 新增 label:只要能作为 pod label 或 pod annotation 的信息均可新增为监控指标的通用 label。

  • 剔除 / 更名 label:除 cloud_account_idapig_workspace_id 之外的所有 label 均可剔除或更名。

用户 label

key含义示例
flavor_idPod 所用计算规格ml.g1.xlarge

pod

Pod 名字

  • 自定义任务:t-20230412151840-j7mn9-worker-0

  • 开发机:di-20230309102205-snxfm

  • 在线服务: s-20230308115910-8z4lh-67b9b5886-njrgc

k8s_pod_start_timePod 开始运行时间2023-04-12 07:18:57 +0000 UTC
resource_group_id资源组 idr-20230329213646-d5rpk
resource_queue_id资源队列 idq-20230329213721-qn9fp

task_id

任务 id(开发机 id、自定义任务 id、在线服务 id)

  • 自定义任务:t-20230412151840-j7mn9

  • 开发机:di-20230309102205-snxfm

  • 在线服务: s-20230308115910-8z4lh

task_type

任务类型

  • 自定义任务:mljob

  • 开发机:devinstance

  • 在线服务: inference

user_id

子用户id

  • 【固定值】主账号用户:0

  • 子账号用户:448569

mljob_task_name自定义任务(task_type="mljob")特有,pod在该自定义任务中的子任务(角色)名"worker"、"ps"
mljob_replica_index自定义任务(task_type="mljob")特有,pod在该自定义任务中的子任务(角色)编号"0"、"1"

系统 label

key含义示例
cloud_account_id主账号id2100000050

job
service_name

指标来源,cadvisor 和 dcgm(gpu metrics)

  • 【固定值】kubernetes-cadvisor

  • 【固定值】dcgm-exporter

apig_workspace_id投递监控工作区idb98624b3-a809-4da7-b619-eea6c745dea2

k8s_namespace_name
namespace

k8s集群namespace

mlplatform-customtask

k8s_node_name
net_host_name
instance

pod所属node name

192.168.0.105

k8s_pod_name_namespacepod name + namespace,唯一辩识t-20230412151840-j7mn9-worker-0.mlplatform-customtask
k8s_pod_namePod name

Cadvisor 指标

kublet 暴露的容器的运行统计信息,主要包含 CPU、内存、网络等非 GPU 的监控指标。
以下指标为 container 为单位,在线服务、开发机和自定义任务均为单容器 Pod,同一指标一个 Pod 会有三条数据:

  • Pod 总体

  • infra 容器:pause

  • 用户的业务主容器:用户仅需关心该容器的监控数据即可,不同负载类型的主容器名不同。

    • 开发机主容器名:main。

    • 自定义任务主容器名:mljob。

    • 在线服务主容器名:{task_id},task_id 为用户 label。

  • 通过 {container!=""} 可过滤出主容器的数据,如container_cpu_usage_seconds_total{container!=""}

指标名称类型单位含义
container_cpu_cfs_periods_totalCounterNumber of elapsed enforcement period intervals
container_cpu_cfs_throttled_periods_totalCounterNumber of throttled period intervals
container_cpu_cfs_throttled_seconds_totalCountersecondsTotal time duration the container has been throttled
container_cpu_load_average_10sGaugeValue of container cpu load average over the last 10 seconds
container_cpu_system_seconds_totalCountersecondsCumulative system cpu time consumed
container_cpu_usage_seconds_totalCountersecondsCumulative cpu time consumed
container_cpu_user_seconds_totalCountersecondsCumulative user cpu time consumed
container_file_descriptorsGaugeNumber of open file descriptors for the container
container_fs_reads_bytes_totalCounterbytesCumulative count of bytes read
container_fs_reads_totalCounterCumulative count of reads completed
container_fs_writes_bytes_totalCounterbytesCumulative count of bytes written
container_fs_writes_totalCounterCumulative count of writes completed
container_last_seenGaugetimestampLast time a container was seen by the exporter
container_memory_cacheGaugebytesTotal page cache memory
container_memory_failcntCounterNumber of memory usage hits limits
container_memory_failures_totalCounterCumulative count of memory allocation failures
container_memory_mapped_fileGaugebytesSize of memory mapped files
container_memory_max_usage_bytesGaugebytesMaximum memory usage recorded
container_memory_rssGaugebytesSize of RSS
container_memory_swapGaugebytesContainer swap usage
container_memory_usage_bytesGaugebytesCurrent memory usage, including all memory regardless of when it was accessed
container_memory_working_set_bytesGaugebytesCurrent working set
container_network_receive_bytes_totalCounterbytesCumulative count of bytes received
container_network_receive_errors_totalCounterCumulative count of errors encountered while receiving
container_network_receive_packets_dropped_totalCounterCumulative count of packets dropped while receiving
container_network_receive_packets_totalCounterCumulative count of packets received
container_network_transmit_bytes_totalCounterbytesCumulative count of bytes transmitted
container_network_transmit_errors_totalCounterCumulative count of errors encountered while transmitting
container_network_transmit_packets_dropped_totalCounterCumulative count of packets dropped while transmitting
container_network_transmit_packets_totalCounterCumulative count of packets transmitted
container_processesGaugeNumber of processes running inside the container
container_socketsGaugeNumber of open sockets for the container
container_spec_cpu_periodGaugeCPU period of the container
container_spec_cpu_quotaGaugeCPU quota of the container
container_spec_cpu_sharesGaugeCPU share of the container
container_spec_memory_limit_bytesGaugebytesMemory limit for the container

container_spec_memory_reservation_limit_bytes

Gauge

bytes

Memory reservation limit for the container

container_spec_memory_swap_limit_bytesGaugebytesMemory swap limit for the container
container_start_time_secondsGaugesecondsStart time of the container since unix epoch
container_tasks_stateGaugeNumber of tasks in given state (sleeping, running, stopped, uninterruptible, or ioawaiting)
container_threadsGaugeNumber of threads running inside the container
container_threads_maxGaugeMaximum number of threads allowed inside the container
container_ulimits_softGaugeSoft ulimit values for the container root process. Unlimited if -1, except priority and nice

DCGM 指标

DCGM(NVIDIA Data Center GPU Manager)采集的如 利用率、SM 利用率、PCIe / NVLink 流量速率等 GPU 相关监控指标。如需了解指标的更多信息详见Profiling MetricsField Identifiers

Utilization

指标名称指标类型单位说明

DCGM_FI_DEV_DEC_UTIL

Gauge

%

Decoder Utilization

DCGM_FI_DEV_ENC_UTILGauge%Encoder Utilization
DCGM_FI_DEV_GPU_UTILGauge%GPU Utilization
DCGM_FI_DEV_MEM_COPY_UTILGauge%Memory Utilization

Memory

指标名称指标类型单位说明
DCGM_FI_DEV_FB_FREEGaugeMiBFree Frame Buffer in MB
DCGM_FI_DEV_FB_USEDGaugeMiBUsed Frame Buffer in MB

Profiling

指标名称指标类型单位说明
DCGM_FI_PROF_DRAM_ACTIVEGauge%The fraction of cycles where data was sent to or received from device memory. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of device memory. An activity of 1 (100%) is equivalent to a DRAM instruction every cycle over the entire time interval (in practice a peak of ~0.8 (80%) is the maximum achievable). An activity of 0.2 (20%) indicates that 20% of the cycles are reading from or writing to device memory over the time interval.
DCGM_FI_PROF_GR_ENGINE_ACTIVEGauge%The fraction of time any portion of the graphics or compute engines were active. The graphics engine is active if a graphics/compute context is bound and the graphics/compute pipe is busy. The value represents an average over a time interval and is not an instantaneous value.
DCGM_FI_PROF_PCIE_TX_BYTES DCGM_FI_PROF_PCIE_RX_BYTESCounterB/sThe rate of data transmitted / received over the PCIe bus, including both protocol headers and data payloads, in bytes per second. The value represents an average over a time interval and is not an instantaneous value. The rate is averaged over the time interval. For example, if 1 GB of data is transferred over 1 second, the rate is 1 GB/s regardless of the data transferred at a constant rate or in bursts. The theoretical maximum PCIe Gen3 bandwidth is 985 MB/s per lane.
DCGM_FI_PROF_PIPE_FP16_ACTIVE 注:A10型号显卡没有此指标Gauge%The fraction of cycles the FP16 (half precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP16 cores. An activity of 1 (100%) is equivalent to a FP16 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities).
DCGM_FI_PROF_PIPE_FP32_ACTIVE 注:A10型号显卡没有此指标Gauge%The fraction of cycles the FMA (FP32 (single precision), and integer) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP32 cores. An activity of 1 (100%) is equivalent to a FP32 instruction every other cycle over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities).
DCGM_FI_PROF_PIPE_FP64_ACTIVE 注:A10型号显卡没有此指标Gauge%The fraction of cycles the FP64 (double precision) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the FP64 cores. An activity of 1 (100%) is equivalent to a FP64 instruction on every SM every fourth cycle on Volta over the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEGauge%The fraction of cycles the tensor (HMMA / IMMA) pipe was active. The value represents an average over a time interval and is not an instantaneous value. Higher values indicate higher utilization of the Tensor Cores. An activity of 1 (100%) is equivalent to issuing a tensor instruction every other cycle for the entire time interval. An activity of 0.2 (20%) could indicate 20% of the SMs are at 100% utilization over the entire time period, 100% of the SMs are at 20% utilization over the entire time period, 100% of the SMs are at 100% utilization for 20% of the time period, or any combination in between (see DCGM_FI_PROF_SM_ACTIVE to help disambiguate these possibilities).
DCGM_FI_PROF_SM_ACTIVEGauge%The fraction of time at least one warp was active on a multiprocessor, averaged over all multiprocessors. Note that “active” does not necessarily mean a warp is actively computing. For instance, warps waiting on memory requests are considered active. The value represents an average over a time interval and is not an instantaneous value. A value of 0.8 or greater is necessary, but not sufficient, for effective use of the GPU. A value less than 0.5 likely indicates ineffective GPU usage. Given a simplified GPU architectural view, if a GPU has N SMs then a kernel using N blocks that runs over the entire time interval will correspond to an activity of 1 (100%). A kernel using N/5 blocks that runs over the entire time interval will correspond to an activity of 0.2 (20%). A kernel using N blocks that runs over one fifth of the time interval, with the SMs otherwise idle, will also have an activity of 0.2 (20%). The value is insensitive to the number of threads per block (see DCGM_FI_PROF_SM_OCCUPANCY).

Clock

指标名称指标类型单位说明
DCGM_FI_DEV_SM_CLOCKGaugeMHzSM clock for the device
DCGM_FI_DEV_MEM_CLOCKGaugeMHzMemory clock for the device

XidError & Violations

指标名称指标类型单位说明
DCGM_FI_DEV_XID_ERRORSGauge-XID errors. The value is the specific XID error

Temperature & Power

指标名称指标类型单位说明
DCGM_FI_DEV_MEMORY_TEMPGaugeCMemory temperature for the device
DCGM_FI_DEV_GPU_TEMPGaugeCCurrent temperature readings for the device, in degrees C
DCGM_FI_DEV_POWER_USAGEGaugeWPower usage for the device in Watts
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONCounterJTotal energy consumption for the GPU in mJ since the driver was last reloaded

Remapped Rows

指标名称指标类型单位说明
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWScounter-Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWScounter-Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILUREgauge-Whether remapping of rows has failed

PCIe

指标名称指标类型单位说明
DCGM_FI_DEV_PCIE_REPLAY_COUNTERcounter-Total number of PCIe retries.

NVLink

指标名称指标类型单位说明
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALcounter-Total number of NVLink bandwidth counters for all lanes

VGPU License Status

指标类型单位说明
DCGM_FI_DEV_MEMORY_TEMPGaugeCMemory temperature for the device