You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

重启openibd(InfiniBand)服务时以太网连接中断问题求助

重启openibd(InfiniBand)服务时以太网连接中断问题求助

各位大佬好,我遇到了一个头疼的问题,想请大家帮忙分析下:

我有多台服务器,板载了以太网控制器,同时PCI插槽里安装了InfiniBand控制器。按道理这两个网络应该是完全独立的,但现在出现了异常:当我重启本该只管控InfiniBand适配器openibd.service时,我的以太网网络也会跟着重启;甚至当我停止openibd服务时,以太网连接也会直接中断,直到openibd重新运行才恢复。

我需要实现的是:停止或重启openibd.service时,完全不影响以太网的正常连接。

系统环境信息

  • 操作系统:AlmaLinux 8.7
  • 当前使用的以太网端口:eno2np1(1Gb带宽)
  • OFED驱动版本:MLNX_OFED_LINUX-5.9-0.5.6.0
  • 网卡固件状态:所有网卡固件均已更新,固件查询结果如下:
./mlxfwmanager_LeSI_23B_OFED-23.04-1_build4_fw_update_aug_2023 --query :

Querying Mellanox devices firmware ...

Device #1:

----------

Device Type:      ConnectX4LX
Part Number:      Lenovo_Ultron_CX4Lx_2P_25GbE_1G-BaseT_Ax
Description:      Lenovo Ultron ConnectX-4 Lx LOM 25GbE and 1G-BaseT
PSID:             LNV0000000028
PCI Device Name:  0000:65:00.0
Base MAC:         088fc3a3cb9e
Versions:         Current        Available
FW             14.32.1010     14.32.1010
PXE            3.6.0502       3.6.0502
UEFI           14.25.0017     14.25.0017
Status:           Up to date

Device #2:

----------

Device Type:      ConnectX6
Part Number:      SC57A40943_Ax
Description:      ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter
PSID:             LNV0000000016
PCI Device Name:  0000:17:00.0
Base GUID:        946dae030049bd14
Versions:         Current        Available
FW             20.37.1014     20.37.1014
PXE            3.7.0102       3.7.0102
UEFI           14.30.0013     14.30.0013
Status:           Up to date

硬件与驱动详情

lspci -nnn输出

17:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
65:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
65:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

lshw -C network输出

*-network
description: interface
product: MT28908 Family [ConnectX-6]
vendor: Mellanox Technologies
physical id: 0
bus info: pci@0000:17:00.0
logical name: ib0
version: 00
serial: 00:00:0a:81:fe:80:00:00:00:00:00:00:94:6d:00:00:00:00:00:00
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom physical
configuration: autonegotiation=off broadcast=yes driver=mlx5_core[ib_ipoib] driverversion=5.9-0.5.5 duplex=full firmware=20.37.1014 (LNV0000000016) ip=192.168.0.3 latency=0 link=yes multicast=yes
resources: iomemory:21f0-21ef irq:18 memory:21ffc000000-21ffdffffff memory:d4200000-d42fffff
*-network:0
description: Ethernet interface
product: MT27710 Family [ConnectX-4 Lx]
vendor: Mellanox Technologies
physical id: 0
bus info: pci@0000:65:00.0
logical name: eno1np0
version: 00
serial: 08:8f:c3:a3:cb:9e
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 firmware=14.32.1010 (LNV0000000028) latency=0 link=no multicast=yes
resources: iomemory:24f0-24ef irq:18 memory:24ffc000000-24ffdffffff memory:e3500000-e35fffff memory:24ffe800000-24ffeffffff
*-network:1
description: Ethernet interface
product: MT27710 Family [ConnectX-4 Lx]
vendor: Mellanox Technologies
physical id: 0.1
bus info: pci@0000:65:00.1
logical name: eno2np1
version: 00
serial: 08:8f:c3:a3:cb:9f
size: 1Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 duplex=full firmware=14.32.1010 (LNV0000000028) ip=10.0.26.3 latency=0 link=yes multicast=yes speed=1Gbit/s
resources: iomemory:24f0-24ef irq:19 memory:24ffa000000-24ffbffffff memory:e3400000-e34fffff memory:24ffe000000-24ffe7fffff

ethtool eno2np1输出

Settings for eno2np1:
Supported ports: [  ]
Supported link modes:   1000baseKX/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None        RS      BASER
Advertised link modes:  1000baseKX/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None       RS      BASER
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: None
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: g
Current message level: 0x00000004 (4)
                       link
Link detected: yes

IB端口ib0信息

Settings for ib0:
Supported ports: [  ]
Supported link modes:   Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes:  Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes

重启openibd时的系统日志(/var/log/messages)

systemd[1]: Stopping openibd - configure Mellanox devices...
root[8303]: openibd: running in manual mode
systemd[1]: /usr/lib/systemd/system/ibacm.service:22: Unknown lvalue 'ProtectHostname' in section 'Service'
systemd[1]: /usr/lib/systemd/system/ibacm.service:23: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
NetworkManager[1345]: <info>  [1692350943.3204] device (ib0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
dbus-daemon[1341]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.1' (uid=0 pid=1345 comm="/usr/sbin/NetworkManager --no-daemon ")
systemd[1]: Starting Network Manager Script Dispatcher Service...
dbus-daemon[1341]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
systemd[1]: Started Network Manager Script Dispatcher Service.
systemd[1]: Stopping RDMA Node Description Daemon...
systemd[1]: rdma-ndd.service: Succeeded.
systemd[1]: Stopped RDMA Node Description Daemon.
NetworkManager[1345]: <info>  [1692350945.4769] device (eno2np1): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
NetworkManager[1345]: <info>  [1692350945.4912] dhcp4 (eno2np1): canceled DHCP transaction
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): activation: beginning transaction (timeout in 45 seconds)
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): state changed no lease
NetworkManager[1345]: <info>  [1692350945.4926] manager: NetworkManager state is now DISCONNECTED

已尝试的排查操作

  • 重装纯净版操作系统
  • 更新服务器UEFI固件
  • 更新Mellanox网卡固件及OFED驱动

我怀疑是不是两块网卡共用了同一个驱动导致的,但不确定该怎么进一步排查和解决,恳请各位大佬指点!

备注:内容来源于stack exchange,提问作者Oren

火山引擎 最新活动