XCP-ng 8.1 host loses network when running gateway/firewall VMs
-
We are in the process of migrating our VMs from a XenServer 6.5 pool to a new pool running XCP-ng 8.1. After we migrated some VMs that are acting as gateways / firewalls for internal networks, the host(s) those VMs are running on loses network within a few minutes, at times within seconds, of the VM booting up. (The host is still running, and if I log in on the console everything except any network is working.)
The new pool is running on HP BL460c Gen8 blades, with 10Gb Flexfabric NICs, using the bnx2x driver.
When the host loses network these messages appear in kern.log:
[09:34 oslo5pool3h03 log]$ sudo grep bnx2x kern.log | grep timeout Nov 9 10:24:52 oslo5pool3h03 kernel: [ 1537.425714] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished Nov 9 10:24:54 oslo5pool3h03 kernel: [ 1538.584055] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished Nov 9 10:25:24 oslo5pool3h03 kernel: [ 1568.785236] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished Nov 9 10:25:25 oslo5pool3h03 kernel: [ 1569.940934] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished
Some pages I found through Google hint at IO-MMU being the problem. I tried disabling IO-MMU through grub parameters to the kernel, when I did that the host rebooted immediately when the test-VM caused the problem.
The NICs seem to be on the XenServer HCL, and since these are blade servers I can't swap the NICs to a different chipset, since all HPE NICs for this blade generation uses that same chipset.
"Regular" VMs are working fine, but VMs with multiple virtual NICs where there's traffic going from one interface to another seem to reliably crash the host.
I've applied all available updates to XCP-ng, this didn't make any difference.
Since we're not using FibreChannel, I tried disabling that module, also I tried disabling some offloading:
[09:40 oslo5pool3h03 log]$ cat /etc/modprobe.d/qlogic-netxtreme2.conf options bnx2x num_vfs=0 options bnx2x disable_tpa=1 [09:40 oslo5pool3h03 log]$ cat /etc/modprobe.d/blacklist-fc.conf blacklist bnx2fc
Neither of these made any difference.
-
Ping @r1 and/or @fohdeesha
-
@vegarnilsen can you share
# dmesg
and# lsmod
? We may have to try a different version of the driver to fix this. May beAlso share
# rpm -qa | grep bnx
. -
@r1 Sure, take a look here:
https://gist.github.com/vegarnilsen/65409692fae1430efd5422860e489ef2
-
@vegarnilsen Ok, that was helpful.
Can you try installing
broadcom-bnxt-en-alt.x86_64
and report the observations? You would need a reboot. -
Indeed this seems like yet another broadcom driver/firmware issue (not uncommon)
-
@r1 We're not using the bnxt_en driver, we're using the bnx2x driver. But given your request I looked for and installed the alternate qlogic driver:
[10:29 oslo5pool3h03 etc]$ rpm -qa | grep qlogic qlogic-qla2xxx-firmware-8.03.02-1.xcpng8.1.x86_64 qlogic-netxtreme2-4.19.0+1-modules-7.14.53-1.1.xcpng8.1.x86_64 qlogic-qla2xxx-10.01.00.54.80.0_k-1.xcpng8.1.x86_64 qlogic-fastlinq-8.37.30.0-3.xcpng8.1.x86_64 qlogic-netxtreme2-7.14.53-1.1.xcpng8.1.x86_64 [10:29 oslo5pool3h03 etc]$ rpm -qil qlogic-netxtreme2-4.19.0+1-modules-7.14.53-1.1.xcpng8.1.x86_64 Name : qlogic-netxtreme2-4.19.0+1-modules Version : 7.14.53 Release : 1.1.xcpng8.1 Architecture: x86_64 Install Date: Tue 22 Sep 2020 06:04:01 PM CEST Group : System Environment/Kernel Size : 3048296 License : GPL Signature : RSA/SHA1, Wed 12 Feb 2020 01:27:25 PM CET, Key ID cd75783a3fd3ac9e Source RPM : qlogic-netxtreme2-7.14.53-1.1.xcpng8.1.src.rpm Build Date : Wed 12 Feb 2020 01:13:59 PM CET Build Host : koji.xcp-ng.org Relocations : (not relocatable) Packager : XCP-ng Vendor : XCP-ng Summary : Qlogic netxtreme2 device drivers Description : Qlogic netxtreme2 device drivers for the Linux Kernel version 4.19.0+1. /etc/modprobe.d/qlogic-netxtreme2.conf /lib/modules/4.19.0+1/updates/bnx2.ko /lib/modules/4.19.0+1/updates/bnx2fc.ko /lib/modules/4.19.0+1/updates/bnx2i.ko /lib/modules/4.19.0+1/updates/bnx2x.ko /lib/modules/4.19.0+1/updates/cnic.ko [10:29 oslo5pool3h03 etc]$ yum search qlogic Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile Excluding mirror: updates.xcp-ng.org * xcp-ng-base: mirrors.xcp-ng.org Excluding mirror: updates.xcp-ng.org * xcp-ng-updates: mirrors.xcp-ng.org ====================================================== N/S matched: qlogic ======================================================= qlogic-fastlinq.x86_64 : Qlogic fastlinq device drivers qlogic-fastlinq-debuginfo.x86_64 : Debug information for package qlogic-fastlinq qlogic-netxtreme2.x86_64 : Qlogic NetXtreme II iSCSI, 1-Gigabit and 10-Gigabit ethernet drivers qlogic-netxtreme2-4.19.0+1-modules.x86_64 : Qlogic netxtreme2 device drivers qlogic-netxtreme2-alt.x86_64 : Qlogic NetXtreme II iSCSI, 1-Gigabit and 10-Gigabit ethernet drivers qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 : Qlogic netxtreme2 device drivers qlogic-netxtreme2-alt-debuginfo.x86_64 : Debug information for package qlogic-netxtreme2-alt qlogic-netxtreme2-debuginfo.x86_64 : Debug information for package qlogic-netxtreme2 qlogic-qla2xxx.x86_64 : Qlogic qla2xxx device drivers qlogic-qla2xxx-debuginfo.x86_64 : Debug information for package qlogic-qla2xxx qlogic-qla2xxx-firmware.x86_64 : Qlogic qla2xxx firmware qlogic-qla2xxx-firmware-debuginfo.x86_64 : Debug information for package qlogic-qla2xxx-firmware Name and summary matches only, use "search all" for everything. [10:30 oslo5pool3h03 etc]$ yum info qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile Excluding mirror: updates.xcp-ng.org * xcp-ng-base: mirrors.xcp-ng.org Excluding mirror: updates.xcp-ng.org * xcp-ng-updates: mirrors.xcp-ng.org Available Packages Name : qlogic-netxtreme2-alt-4.19.0+1-modules Arch : x86_64 Version : 7.14.63 Release : 2.xcpng8.1 Size : 1.2 M Repo : xcp-ng-base Summary : Qlogic netxtreme2 device drivers License : GPL Description : Qlogic netxtreme2 device drivers for the Linux Kernel : version 4.19.0+1. [10:30 oslo5pool3h03 etc]$ sudo yum install qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile Excluding mirror: updates.xcp-ng.org * xcp-ng-base: mirrors.xcp-ng.org Excluding mirror: updates.xcp-ng.org * xcp-ng-updates: mirrors.xcp-ng.org Resolving Dependencies --> Running transaction check ---> Package qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 0:7.14.63-2.xcpng8.1 will be installed --> Finished Dependency Resolution Dependencies Resolved ================================================================================================================================== Package Arch Version Repository Size ================================================================================================================================== Installing: qlogic-netxtreme2-alt-4.19.0+1-modules x86_64 7.14.63-2.xcpng8.1 xcp-ng-base 1.2 M Transaction Summary ================================================================================================================================== Install 1 Package Total download size: 1.2 M Installed size: 2.9 M Is this ok [y/d/N]: y Downloading packages: qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64.rpm | 1.2 MB 00:00:00 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64 1/1 Verifying : qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64 1/1 Installed: qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 0:7.14.63-2.xcpng8.1 Complete! [10:32 oslo5pool3h03 etc]$
I rebooted the server, and booted up a couple of the VMs I'm having issues with, and then I ran
ping
from one of the internal servers to an external site:64 bytes from www.vg.no (195.88.54.16): icmp_seq=1338 ttl=248 time=2.88 ms 64 bytes from www.vg.no (195.88.54.16): icmp_seq=1339 ttl=248 time=3.04 ms 64 bytes from www.vg.no (195.88.54.16): icmp_seq=1340 ttl=248 time=3.17 ms 64 bytes from www.vg.no (195.88.54.16): icmp_seq=1341 ttl=248 time=2.91 ms client_loop: send disconnect: Broken pipe client_loop: send disconnect: Broken pipe
However, as you can see, this crashed the host after a while and resulted in a host with no network.
-
@vegarnilsen Thanks, you got the correct one.
Can you share
# modinfo bnx2x
?We have
/lib/modules/4.19.0+1/kernel/drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko
with1.712.30-0
/lib/modules/4.19.0+1/updates/bnx2x.ko
with1.714.24
/lib/modules/4.19.0+1/override/bnx2x.ko
with1.715.0
One of them will be loaded in above order depending on its presence.
-
Could the
fcoe
driver causing the issue?dmesg:
[ 42.363389] bnx2fc: QLogic FCoE Driver bnx2fc v2.12.5 (November 16, 2018) [ 42.371336] bnx2fc: FCoE initialized for eth1. [ 42.371641] bnx2fc: [04]: FCOE_INIT passed [ 42.387017] bnx2fc: FCoE initialized for eth0. [ 42.387305] bnx2fc: [04]: FCOE_INIT passed
lsmod:
fcoe 32768 0 libfcoe 77824 2 fcoe,bnx2fc libfc 147456 3 fcoe,bnx2fc,libfcoe scsi_transport_fc 69632 3 fcoe,libfc,bnx2fc
-
@r1 Yup, see https://gist.github.com/vegarnilsen/dce2b5c17cf188f1fa2c7615dc6fefc4 for the modinfo and lsmod output.
@tuxen Since we're not using FibreChannel, I disabled fcoe before the latest test, see the gist above for info.