本文旨在指导用户如何验证当前镜像是否支持 RDMA 能力,用户可以根据下文中的步骤分别在 A800 和 H3c 两种机型上验证自定义镜像是否具备 RDMA 通信能力。
验证RDMA能力,旨在确保自定义镜像能够充分利用硬件的RDMA能力,避免因配置不当导致通信性能下降,从而保障分布式训练等业务的高效运行。
由于 A800 与 H3C 机型的 RDMA 网卡硬件及虚拟化实现方式存在差异,不同机型对镜像内的驱动版本及相关依赖库有特定要求。用户需根据实际使用的机型,参照下文步骤进行环境配置与验证。
说明
目前主流训练容器镜像多基于 Ubuntu 构建(本文以 Ubuntu 22.04 为例)。在进行配置前,请先确认当前环境的操作系统发行版本。
在容器内输入命令:
cat /etc/os-release
输出示例如下:
# cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy
请记录 VERSION 或 VERSION_CODENAME,以便在后续步骤中选择正确的配置方案。
请根据当前实例的机型选择对应的配置流程。
执行以下命令安装 Infiniband 诊断工具:
apt update && apt install -y infiniband-diags
使用 ibstatus 命令查看网卡物理状态及速率:
ibstatus
预期结果:
# ibstatus Infiniband device 'mlx5_1' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_2' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_3' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet Infiniband device 'mlx5_4' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 200 Gb/sec (4X HDR) link_layer: Ethernet
可以看到,本例子中共有四个网卡(mlx5_1 - mlx5_4),由于每段网卡字段都是一致,因此下面以 mlx5_1 为例进行阐释。
default gid:表示默认全局标识符 (Global ID),本例子为展示用写完全 0 (无效标识符)。base / sm lid :本地标识符 (Local ID) 和识符 (Subnet Manager LID)。state :逻辑链路状态, ACTIVE表示健康。phys state :物理链路状态, LinkUp表示健康。rate:网卡速率,单位为 Gb/sec。[其中括号中的 4X HDR 表示着这是通过 4 条通道 (Lane) 达成的 HDR 标准速率]。link_layer :网卡链路层协议,如 Ethernet 表示以太网。可以看到本例中网卡(mlx5_1)速率为 200 Gb/s,对 A800 机型而言这是符合预期的。
说明
如出现 dpkg-query: no packages found matching 报错,即可正常使用,版本号无需和本例保持一致。
执行以下命令检查 RDMA 相关依赖包的安装情况:
dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 ibverbs-utils
# dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 ibverbs-utils Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=======================-============-============-=========================================================== ii ibverbs-providers:amd64 39.0-1 amd64 User space provider drivers for libibverbs ii ibverbs-utils 39.0-1 amd64 Examples for the libibverbs library ii libibumad3:amd64 39.0-1 amd64 InfiniBand Userspace Management Datagram (uMAD) library ii libibverbs1:amd64 39.0-1 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP) ii libnl-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets ii libnl-route-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets - route interface ii librdmacm1:amd64 39.0-1 amd64 Library for managing RDMA connections ii perftest 4.4+0.37-1 amd64 Infiniband verbs performance tests
上述输出信息中包含了已安装(如ibverbs-providers:amd64、libibverbs1:amd64等)的软件。
apt update && apt install -y perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-ro ute-3-200 librdmacm1
安装完毕后,可再次执行检查 RDMA 相关依赖包命令来确保都已安装。
dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 ibverbs-utils
根据 NCCL 版本不同,启用 GPUDirect RDMA (GDR) 的配置方式如下:
2.12 :安装 Sharp 插件以启用 GDR(否则可能导致约 10% 的性能损耗)。输入命令:
apt install automake autoconf libtool libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1 cd /tmp \ && git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins.git \ && cd nccl-rdma-sharp-plugins \ && ./autogen.sh \ && ./configure --prefix=/usr/local/nccl-rdma-sharp-plugins --with-cuda=/usr/local/cuda \ && make && make install \ && rm -rf /tmp/nccl-rdma-sharp-plugins export LD_LIBRARY_PATH="/usr/local/nccl-rdma-sharp-plugins/lib:${LD_LIBRARY_PATH}"
2.12 :原生支持 GDR。如镜像中已预装 Sharp 插件,建议禁用以提升稳定性:export NCCL_NET_PLUGIN=none
若不确定镜像是否预装 Sharp 插件,可执行以下命令检查:
ldconfig -p | grep sharp
确定网卡速率步骤可参考 A800 中相关内容。对于 H3c 机型,网卡速率应为 400Gb/s。
其中一段网卡速率示例:
# ibstatus Infiniband device 'mlx5_1' port 1 status: default gid: 0000:0000:0000:0000:0000:0000:0000:0000 base lid: 0x0 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate: 400 Gb/sec (4X NDR) link_layer: Ethernet
执行命令获取 Ubuntu 发行版代号(Codename):
lsb_release -c
输出示例如下:
# lsb_release -c Codename: focal
确认发行版代号(Codename)有助于在后续步骤中选择或安装适配当前系统版本的依赖包,确保软件兼容性。
Codename 和 Ubuntu 版本对应如下表格:
Codename | Version |
|---|---|
focal | 20.04 |
jammy | 22.04 |
Noble | 24.04 |
执行如下命令检查是否安装 RDMA 相关库。
dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 ibverbs-utils
输出示例如下:
# dpkg -l perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 ibverbs-utils Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=======================-============-============-=========================================================== ii ibverbs-providers:amd64 39.0-1 amd64 User space provider drivers for libibverbs ii ibverbs-utils 39.0-1 amd64 Examples for the libibverbs library ii libibumad3:amd64 39.0-1 amd64 InfiniBand Userspace Management Datagram (uMAD) library ii libibverbs1:amd64 39.0-1 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP) ii libnl-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets ii libnl-route-3-200:amd64 3.5.0-0.1 amd64 library for dealing with netlink sockets - route interface ii librdmacm1:amd64 39.0-1 amd64 Library for managing RDMA connections ii perftest 4.4+0.37-1 amd64 Infiniband verbs performance tests
版本要求:ibverbs-providers 和 libibverbs1 的版本号前两位需 >= 23(例如 28.0)。
apt update && apt install -y perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1
环境配置完成后,需通过 ib_write_bw 工具进行带宽测试,验证 RDMA 功能是否正常。A800 与 H3C 机型的验证步骤一致。
说明
关于**$NCCL_IB_GID_INDEX**的值通常由系统初始化时自动注入,默认值为 2。
为了标准验证网络链路,可以使用两张不同的网卡进行测试,以验证物理链路和交换机配置是否正常。在同一台实例上运行服务端与客户端进行测试,下面分别以 A800 和 H3c 为例进行验证。
ib_write_bw -d mlx5_1 -x $NCCL_IB_GID_INDEX &
输出示例如下:
# ib_write_bw -d mlx5_1 -x $NCCL_IB_GID_INDEX & [1] 104777 root@iv-ybrf933mwd8rx7gs2na5:/workspace# ************************************ * Waiting for client to connect... * ************************************
ib_write_bw -d mlx5_2 -x $NCCL_IB_GID_INDEX 127.0.0.1 --report_gbits
输出示例如下:
--------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 7 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0xf181 PSN 0xd0b8e RKey 0x180ccb VAddr 0x007fbc10090000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:204:225:158 remote address: LID 0000 QPN 0x45ab PSN 0xf53f0e RKey 0x17fbba VAddr 0x007fa705a0b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:33:204:225:190 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 5000 195.11 195.10 0.372126 ---------------------------------------------------------------------------------------
通过输出,我们可以验证对于 A800 机型,带宽值(BW peak、BW average)应接近 **200Gb/s。**如符合要求则说明配置无问题,如无输出或报错请回到根据机型配置环境的部分,检查是否有配置项的遗漏。
说明
对于 H3c 这种具备超高速网络速率(400Gb/s)的设备,单流几乎不可能跑满物理极限,因此为了更直接地展示高带宽网络下的吞吐量,这里使用并发来进行测试带宽极限。
Tips:尽管可以使用并发指令,原本的单流(在 A800 测试示例中的指令)依然可用,只是带宽速率不够稳定(忽高忽低)。
ib_write_bw -d mlx5_1 -s 64K --report_g -q 64 -x $NCCL_IB_GID_INDEX
-s: Size,指定每次发送的消息包大小,在这里为 64KB。-q: QPs (Queue Pairs),指定使用多少对队列(并发连接数),在这里是 64 对。输出示例如下:
# ib_write_bw -d mlx5_1 -s 64K --report_g -q 64 -x $NCCL_IB_GID_INDEX [1] 104777 root@iv-ybrf933mwd8rx7gs2na5:/workspace# ************************************ * Waiting for client to connect... * ************************************
ib_write_bw -d mlx5_2 -s 64K --report_g -q 64 127.0.0.1 -x $NCCL_IB_GID_INDEX
输出示例如下:
--------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_2 Number of qps : 64 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : mlx5_1 Number of qps : 64 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : ON Using Enhanced Reorder : OFF TX depth : 128 CQ Moderation : 1 CQE Poll Batch : Dynamic Mtu : 4096[B] Link type : Ethernet GID index : 7 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- ibv_wr* API : ON Using Enhanced Reorder : OFF CQ Moderation : 1 CQE Poll Batch : Dynamic Mtu : 4096[B] Link type : Ethernet GID index : 7 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x1e00b PSN 0xd2b8af RKey 0x1fff00 VAddr 0x007f7406e3b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:147:22 local address: LID 0000 QPN 0x1e00c PSN 0xc1466d RKey 0x1fff00 VAddr 0x007f7406e4b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:147:22 ··························································· remote address: LID 0000 QPN 0x1e042 PSN 0xa27115 RKey 0x1fff00 VAddr 0x007fb4ca6f0000 remote address: LID 0000 QPN 0x1e048 PSN 0x76be2c RKey 0x1fff00 VAddr 0x007f740720b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:151:22 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:147:22 remote address: LID 0000 QPN 0x1e043 PSN 0x18f548 RKey 0x1fff00 VAddr 0x007fb4ca700000 remote address: LID 0000 QPN 0x1e049 PSN 0xf47213 RKey 0x1fff00 VAddr 0x007f740721b000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:151:22 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:147:22 remote address: LID 0000 QPN 0x1e044 PSN 0x451320 RKey 0x1fff00 VAddr 0x007fb4ca710000 GID: 00:00:00:00:00:00:00:00:00:00:255:255:26:48:151:22 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps] 65536 320000 390.70 390.61 0.745028 ---------------------------------------------------------------------------------------
通过输出,我们可以验证对于 H3c 机型应接近 400Gb/s,如符合要求则说明配置无问题,如无输出或报错请回到根据机型配置环境的部分,检查是否有配置项的遗漏。
ib_write_bw -d mlx5_1 -x $NCCL_IB_GID_INDEX
输出示例如下:
# ib_write_bw -d mlx5_1 -x $NCCL_IB_GID_INDEX ************************************ * Waiting for client to connect... * ************************************
<MACHINE_A_HOST> 请替换为 A 机器的 RDMA 网口 IP。ib_write_bw -d mlx5_1 -x $NCCL_IB_GID_INDEX <MACHINE_A_HOST> --report_gbits