XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.1 host loses network when running gateway/firewall VMs

    Scheduled Pinned Locked Moved Compute
    10 Posts 5 Posters 2.1k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • V Offline
      vegarnilsen
      last edited by

      We are in the process of migrating our VMs from a XenServer 6.5 pool to a new pool running XCP-ng 8.1. After we migrated some VMs that are acting as gateways / firewalls for internal networks, the host(s) those VMs are running on loses network within a few minutes, at times within seconds, of the VM booting up. (The host is still running, and if I log in on the console everything except any network is working.)

      The new pool is running on HP BL460c Gen8 blades, with 10Gb Flexfabric NICs, using the bnx2x driver.

      When the host loses network these messages appear in kern.log:

      [09:34 oslo5pool3h03 log]$ sudo grep bnx2x kern.log | grep timeout
      Nov  9 10:24:52 oslo5pool3h03 kernel: [ 1537.425714] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished
      Nov  9 10:24:54 oslo5pool3h03 kernel: [ 1538.584055] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished
      Nov  9 10:25:24 oslo5pool3h03 kernel: [ 1568.785236] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished
      Nov  9 10:25:25 oslo5pool3h03 kernel: [ 1569.940934] bnx2x: [bnx2x_stats_comp:211(eth0)]timeout waiting for stats finished
      

      Some pages I found through Google hint at IO-MMU being the problem. I tried disabling IO-MMU through grub parameters to the kernel, when I did that the host rebooted immediately when the test-VM caused the problem.

      The NICs seem to be on the XenServer HCL, and since these are blade servers I can't swap the NICs to a different chipset, since all HPE NICs for this blade generation uses that same chipset.

      "Regular" VMs are working fine, but VMs with multiple virtual NICs where there's traffic going from one interface to another seem to reliably crash the host.

      I've applied all available updates to XCP-ng, this didn't make any difference.

      Since we're not using FibreChannel, I tried disabling that module, also I tried disabling some offloading:

      [09:40 oslo5pool3h03 log]$ cat /etc/modprobe.d/qlogic-netxtreme2.conf 
      options bnx2x num_vfs=0
      options bnx2x disable_tpa=1
      
      [09:40 oslo5pool3h03 log]$ cat /etc/modprobe.d/blacklist-fc.conf 
      blacklist bnx2fc
      

      Neither of these made any difference.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Ping @r1 and/or @fohdeesha

        R 1 Reply Last reply Reply Quote 0
        • R Offline
          r1 XCP-ng Team @olivierlambert
          last edited by

          @vegarnilsen can you share # dmesg and # lsmod? We may have to try a different version of the driver to fix this. May be

          Also share # rpm -qa | grep bnx.

          V 1 Reply Last reply Reply Quote 0
          • V Offline
            vegarnilsen @r1
            last edited by

            @r1 Sure, take a look here:

            https://gist.github.com/vegarnilsen/65409692fae1430efd5422860e489ef2

            R 1 Reply Last reply Reply Quote 0
            • R Offline
              r1 XCP-ng Team @vegarnilsen
              last edited by

              @vegarnilsen Ok, that was helpful.

              Can you try installing broadcom-bnxt-en-alt.x86_64 and report the observations? You would need a reboot.

              fohdeeshaF V 2 Replies Last reply Reply Quote 0
              • fohdeeshaF Offline
                fohdeesha Vates 🪐 Pro Support Team @r1
                last edited by

                Indeed this seems like yet another broadcom driver/firmware issue (not uncommon)

                1 Reply Last reply Reply Quote 0
                • V Offline
                  vegarnilsen @r1
                  last edited by

                  @r1 We're not using the bnxt_en driver, we're using the bnx2x driver. But given your request I looked for and installed the alternate qlogic driver:

                  [10:29 oslo5pool3h03 etc]$ rpm -qa | grep qlogic
                  qlogic-qla2xxx-firmware-8.03.02-1.xcpng8.1.x86_64
                  qlogic-netxtreme2-4.19.0+1-modules-7.14.53-1.1.xcpng8.1.x86_64
                  qlogic-qla2xxx-10.01.00.54.80.0_k-1.xcpng8.1.x86_64
                  qlogic-fastlinq-8.37.30.0-3.xcpng8.1.x86_64
                  qlogic-netxtreme2-7.14.53-1.1.xcpng8.1.x86_64
                  [10:29 oslo5pool3h03 etc]$ rpm -qil qlogic-netxtreme2-4.19.0+1-modules-7.14.53-1.1.xcpng8.1.x86_64
                  Name        : qlogic-netxtreme2-4.19.0+1-modules
                  Version     : 7.14.53
                  Release     : 1.1.xcpng8.1
                  Architecture: x86_64
                  Install Date: Tue 22 Sep 2020 06:04:01 PM CEST
                  Group       : System Environment/Kernel
                  Size        : 3048296
                  License     : GPL
                  Signature   : RSA/SHA1, Wed 12 Feb 2020 01:27:25 PM CET, Key ID cd75783a3fd3ac9e
                  Source RPM  : qlogic-netxtreme2-7.14.53-1.1.xcpng8.1.src.rpm
                  Build Date  : Wed 12 Feb 2020 01:13:59 PM CET
                  Build Host  : koji.xcp-ng.org
                  Relocations : (not relocatable)
                  Packager    : XCP-ng
                  Vendor      : XCP-ng
                  Summary     : Qlogic netxtreme2 device drivers
                  Description :
                  Qlogic netxtreme2 device drivers for the Linux Kernel
                  version 4.19.0+1.
                  /etc/modprobe.d/qlogic-netxtreme2.conf
                  /lib/modules/4.19.0+1/updates/bnx2.ko
                  /lib/modules/4.19.0+1/updates/bnx2fc.ko
                  /lib/modules/4.19.0+1/updates/bnx2i.ko
                  /lib/modules/4.19.0+1/updates/bnx2x.ko
                  /lib/modules/4.19.0+1/updates/cnic.ko
                  [10:29 oslo5pool3h03 etc]$ yum search qlogic
                  Loaded plugins: fastestmirror
                  Loading mirror speeds from cached hostfile
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-base: mirrors.xcp-ng.org
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-updates: mirrors.xcp-ng.org
                  ====================================================== N/S matched: qlogic =======================================================
                  qlogic-fastlinq.x86_64 : Qlogic fastlinq device drivers
                  qlogic-fastlinq-debuginfo.x86_64 : Debug information for package qlogic-fastlinq
                  qlogic-netxtreme2.x86_64 : Qlogic NetXtreme II iSCSI, 1-Gigabit and 10-Gigabit ethernet drivers
                  qlogic-netxtreme2-4.19.0+1-modules.x86_64 : Qlogic netxtreme2 device drivers
                  qlogic-netxtreme2-alt.x86_64 : Qlogic NetXtreme II iSCSI, 1-Gigabit and 10-Gigabit ethernet drivers
                  qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 : Qlogic netxtreme2 device drivers
                  qlogic-netxtreme2-alt-debuginfo.x86_64 : Debug information for package qlogic-netxtreme2-alt
                  qlogic-netxtreme2-debuginfo.x86_64 : Debug information for package qlogic-netxtreme2
                  qlogic-qla2xxx.x86_64 : Qlogic qla2xxx device drivers
                  qlogic-qla2xxx-debuginfo.x86_64 : Debug information for package qlogic-qla2xxx
                  qlogic-qla2xxx-firmware.x86_64 : Qlogic qla2xxx firmware
                  qlogic-qla2xxx-firmware-debuginfo.x86_64 : Debug information for package qlogic-qla2xxx-firmware
                  
                    Name and summary matches only, use "search all" for everything.
                  [10:30 oslo5pool3h03 etc]$ yum info qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64
                  Loaded plugins: fastestmirror
                  Loading mirror speeds from cached hostfile
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-base: mirrors.xcp-ng.org
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-updates: mirrors.xcp-ng.org
                  Available Packages
                  Name        : qlogic-netxtreme2-alt-4.19.0+1-modules
                  Arch        : x86_64
                  Version     : 7.14.63
                  Release     : 2.xcpng8.1
                  Size        : 1.2 M
                  Repo        : xcp-ng-base
                  Summary     : Qlogic netxtreme2 device drivers
                  License     : GPL
                  Description : Qlogic netxtreme2 device drivers for the Linux Kernel
                              : version 4.19.0+1.
                  
                  [10:30 oslo5pool3h03 etc]$ sudo yum install qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64
                  Loaded plugins: fastestmirror
                  Loading mirror speeds from cached hostfile
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-base: mirrors.xcp-ng.org
                  Excluding mirror: updates.xcp-ng.org
                   * xcp-ng-updates: mirrors.xcp-ng.org
                  Resolving Dependencies
                  --> Running transaction check
                  ---> Package qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 0:7.14.63-2.xcpng8.1 will be installed
                  --> Finished Dependency Resolution
                  
                  Dependencies Resolved
                  
                  ==================================================================================================================================
                   Package                                           Arch              Version                         Repository              Size
                  ==================================================================================================================================
                  Installing:
                   qlogic-netxtreme2-alt-4.19.0+1-modules            x86_64            7.14.63-2.xcpng8.1              xcp-ng-base            1.2 M
                  
                  Transaction Summary
                  ==================================================================================================================================
                  Install  1 Package
                  
                  Total download size: 1.2 M
                  Installed size: 2.9 M
                  Is this ok [y/d/N]: y
                  Downloading packages:
                  qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64.rpm                                       | 1.2 MB  00:00:00     
                  Running transaction check
                  Running transaction test
                  Transaction test succeeded
                  Running transaction
                    Installing : qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64                                               1/1 
                    Verifying  : qlogic-netxtreme2-alt-4.19.0+1-modules-7.14.63-2.xcpng8.1.x86_64                                               1/1 
                  
                  Installed:
                    qlogic-netxtreme2-alt-4.19.0+1-modules.x86_64 0:7.14.63-2.xcpng8.1                                                              
                  
                  Complete!
                  [10:32 oslo5pool3h03 etc]$ 
                  

                  I rebooted the server, and booted up a couple of the VMs I'm having issues with, and then I ran ping from one of the internal servers to an external site:

                  64 bytes from www.vg.no (195.88.54.16): icmp_seq=1338 ttl=248 time=2.88 ms
                  64 bytes from www.vg.no (195.88.54.16): icmp_seq=1339 ttl=248 time=3.04 ms
                  64 bytes from www.vg.no (195.88.54.16): icmp_seq=1340 ttl=248 time=3.17 ms
                  64 bytes from www.vg.no (195.88.54.16): icmp_seq=1341 ttl=248 time=2.91 ms
                  client_loop: send disconnect: Broken pipe
                  client_loop: send disconnect: Broken pipe
                  

                  However, as you can see, this crashed the host after a while and resulted in a host with no network.

                  R 1 Reply Last reply Reply Quote 0
                  • R Offline
                    r1 XCP-ng Team @vegarnilsen
                    last edited by

                    @vegarnilsen Thanks, you got the correct one.

                    Can you share # modinfo bnx2x?

                    We have /lib/modules/4.19.0+1/kernel/drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko with 1.712.30-0
                    /lib/modules/4.19.0+1/updates/bnx2x.ko with 1.714.24
                    /lib/modules/4.19.0+1/override/bnx2x.ko with 1.715.0

                    One of them will be loaded in above order depending on its presence.

                    T V 2 Replies Last reply Reply Quote 0
                    • T Offline
                      tuxen Top contributor @r1
                      last edited by

                      Could the fcoe driver causing the issue?

                      dmesg:

                      [   42.363389] bnx2fc: QLogic FCoE Driver bnx2fc v2.12.5 (November 16, 2018)
                      [   42.371336] bnx2fc: FCoE initialized for eth1.
                      [   42.371641] bnx2fc: [04]: FCOE_INIT passed
                      [   42.387017] bnx2fc: FCoE initialized for eth0.
                      [   42.387305] bnx2fc: [04]: FCOE_INIT passed
                      

                      lsmod:

                      fcoe                   32768  0 
                      libfcoe                77824  2 fcoe,bnx2fc
                      libfc                 147456  3 fcoe,bnx2fc,libfcoe
                      scsi_transport_fc      69632  3 fcoe,libfc,bnx2fc
                      
                      1 Reply Last reply Reply Quote 0
                      • V Offline
                        vegarnilsen @r1
                        last edited by

                        @r1 Yup, see https://gist.github.com/vegarnilsen/dce2b5c17cf188f1fa2c7615dc6fefc4 for the modinfo and lsmod output.

                        @tuxen Since we're not using FibreChannel, I disabled fcoe before the latest test, see the gist above for info.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post