重启openibd(InfiniBand)服务时以太网连接中断问题求助
重启openibd(InfiniBand)服务时以太网连接中断问题求助
各位大佬好,我遇到了一个头疼的问题,想请大家帮忙分析下:
我有多台服务器,板载了以太网控制器,同时PCI插槽里安装了InfiniBand控制器。按道理这两个网络应该是完全独立的,但现在出现了异常:当我重启本该只管控InfiniBand适配器的openibd.service时,我的以太网网络也会跟着重启;甚至当我停止openibd服务时,以太网连接也会直接中断,直到openibd重新运行才恢复。
我需要实现的是:停止或重启openibd.service时,完全不影响以太网的正常连接。
系统环境信息
- 操作系统:AlmaLinux 8.7
- 当前使用的以太网端口:
eno2np1(1Gb带宽) - OFED驱动版本:MLNX_OFED_LINUX-5.9-0.5.6.0
- 网卡固件状态:所有网卡固件均已更新,固件查询结果如下:
./mlxfwmanager_LeSI_23B_OFED-23.04-1_build4_fw_update_aug_2023 --query : Querying Mellanox devices firmware ... Device #1: ---------- Device Type: ConnectX4LX Part Number: Lenovo_Ultron_CX4Lx_2P_25GbE_1G-BaseT_Ax Description: Lenovo Ultron ConnectX-4 Lx LOM 25GbE and 1G-BaseT PSID: LNV0000000028 PCI Device Name: 0000:65:00.0 Base MAC: 088fc3a3cb9e Versions: Current Available FW 14.32.1010 14.32.1010 PXE 3.6.0502 3.6.0502 UEFI 14.25.0017 14.25.0017 Status: Up to date Device #2: ---------- Device Type: ConnectX6 Part Number: SC57A40943_Ax Description: ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter PSID: LNV0000000016 PCI Device Name: 0000:17:00.0 Base GUID: 946dae030049bd14 Versions: Current Available FW 20.37.1014 20.37.1014 PXE 3.7.0102 3.7.0102 UEFI 14.30.0013 14.30.0013 Status: Up to date
硬件与驱动详情
lspci -nnn输出
17:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b] 65:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015] 65:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
lshw -C network输出
*-network description: interface product: MT28908 Family [ConnectX-6] vendor: Mellanox Technologies physical id: 0 bus info: pci@0000:17:00.0 logical name: ib0 version: 00 serial: 00:00:0a:81:fe:80:00:00:00:00:00:00:94:6d:00:00:00:00:00:00 width: 64 bits clock: 33MHz capabilities: pciexpress vpd msix pm bus_master cap_list rom physical configuration: autonegotiation=off broadcast=yes driver=mlx5_core[ib_ipoib] driverversion=5.9-0.5.5 duplex=full firmware=20.37.1014 (LNV0000000016) ip=192.168.0.3 latency=0 link=yes multicast=yes resources: iomemory:21f0-21ef irq:18 memory:21ffc000000-21ffdffffff memory:d4200000-d42fffff *-network:0 description: Ethernet interface product: MT27710 Family [ConnectX-4 Lx] vendor: Mellanox Technologies physical id: 0 bus info: pci@0000:65:00.0 logical name: eno1np0 version: 00 serial: 08:8f:c3:a3:cb:9e width: 64 bits clock: 33MHz capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 firmware=14.32.1010 (LNV0000000028) latency=0 link=no multicast=yes resources: iomemory:24f0-24ef irq:18 memory:24ffc000000-24ffdffffff memory:e3500000-e35fffff memory:24ffe800000-24ffeffffff *-network:1 description: Ethernet interface product: MT27710 Family [ConnectX-4 Lx] vendor: Mellanox Technologies physical id: 0.1 bus info: pci@0000:65:00.1 logical name: eno2np1 version: 00 serial: 08:8f:c3:a3:cb:9f size: 1Gbit/s width: 64 bits clock: 33MHz capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 duplex=full firmware=14.32.1010 (LNV0000000028) ip=10.0.26.3 latency=0 link=yes multicast=yes speed=1Gbit/s resources: iomemory:24f0-24ef irq:19 memory:24ffa000000-24ffbffffff memory:e3400000-e34fffff memory:24ffe000000-24ffe7fffff
ethtool eno2np1输出
Settings for eno2np1: Supported ports: [ ] Supported link modes: 1000baseKX/Full Supported pause frame use: Symmetric Supports auto-negotiation: Yes Supported FEC modes: None RS BASER Advertised link modes: 1000baseKX/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: None RS BASER Speed: 1000Mb/s Duplex: Full Auto-negotiation: on Port: None PHYAD: 0 Transceiver: internal Supports Wake-on: g Wake-on: g Current message level: 0x00000004 (4) link Link detected: yes
IB端口ib0信息
Settings for ib0: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Supported FEC modes: Not reported Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Advertised FEC modes: Not reported Speed: 100000Mb/s Duplex: Full Auto-negotiation: off Port: Other PHYAD: 0 Transceiver: internal Link detected: yes
重启openibd时的系统日志(/var/log/messages)
systemd[1]: Stopping openibd - configure Mellanox devices... root[8303]: openibd: running in manual mode systemd[1]: /usr/lib/systemd/system/ibacm.service:22: Unknown lvalue 'ProtectHostname' in section 'Service' systemd[1]: /usr/lib/systemd/system/ibacm.service:23: Unknown lvalue 'ProtectKernelLogs' in section 'Service' NetworkManager[1345]: <info> [1692350943.3204] device (ib0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed') dbus-daemon[1341]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.1' (uid=0 pid=1345 comm="/usr/sbin/NetworkManager --no-daemon ") systemd[1]: Starting Network Manager Script Dispatcher Service... dbus-daemon[1341]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' systemd[1]: Started Network Manager Script Dispatcher Service. systemd[1]: Stopping RDMA Node Description Daemon... systemd[1]: rdma-ndd.service: Succeeded. systemd[1]: Stopped RDMA Node Description Daemon. NetworkManager[1345]: <info> [1692350945.4769] device (eno2np1): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed') NetworkManager[1345]: <info> [1692350945.4912] dhcp4 (eno2np1): canceled DHCP transaction NetworkManager[1345]: <info> [1692350945.4913] dhcp4 (eno2np1): activation: beginning transaction (timeout in 45 seconds) NetworkManager[1345]: <info> [1692350945.4913] dhcp4 (eno2np1): state changed no lease NetworkManager[1345]: <info> [1692350945.4926] manager: NetworkManager state is now DISCONNECTED
已尝试的排查操作
- 重装纯净版操作系统
- 更新服务器UEFI固件
- 更新Mellanox网卡固件及OFED驱动
我怀疑是不是两块网卡共用了同一个驱动导致的,但不确定该怎么进一步排查和解决,恳请各位大佬指点!
备注:内容来源于stack exchange,提问作者Oren




