XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Mellanox ConnectX-3 - Card not working

    Scheduled Pinned Locked Moved Hardware
    12 Posts 4 Posters 3.1k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P Offline
      Pyroteq
      last edited by

      Hi all,

      Recently bought a Mellanox ConnectX-3 CX311A from eBay.

      Plugged it into my hypervisor XCP-NG 8.3, but can't seem to get the card to work. As far as I know this card is ethernet only so shouldn't require any flashing or anything like that.

      I'm using a new DAC from FS.com from the card to the 10GB SFP+ port on my Juniper switch.

      Card is detected:

      lspci
      04:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
      
      lspci -v | grep Mellanox
      04:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
              Subsystem: Mellanox Technologies Device 0055
      

      I deleted and re-created my PCI passthrough config

      xl pci-assignable-list
      0000:09:00.0
      0000:05:00.1
      0000:01:00.0
      0000:0a:00.0
      0000:05:00.0
      0000:01:00.1
      

      Tried PIF scan in GUI and console but when I check PIF list I only see the on board NIC

      xe pif-scan host-uuid=ff0c8a58-0feb-4fe1-8cc1-556aad1f8c75
      xe pif-list
      uuid ( RO)                  : a902c9a0-77f2-ad66-9f69-814e9bdd6413
                      device ( RO): eth0
                         MAC ( RO): d8:5e:d3:2b:23:8b
          currently-attached ( RO): true
                        VLAN ( RO): -1
                network-uuid ( RO): 5e0c47e8-abbd-bc5b-02eb-83b838560f90
                   host-uuid ( RO): ff0c8a58-0feb-4fe1-8cc1-556aad1f8c75
      
      ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
      2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP group default qlen 1000
          link/ether d8:5e:d3:2b:23:8b brd ff:ff:ff:ff:ff:ff
      3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 92:b5:0c:1e:3e:cc brd ff:ff:ff:ff:ff:ff
      4: xenbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
          link/ether d8:5e:d3:2b:23:8b brd ff:ff:ff:ff:ff:ff
          inet 192.168.1.13/24 brd 192.168.1.255 scope global dynamic xenbr0
             valid_lft 81730sec preferred_lft 81730sec
      5: vif1.0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
          link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
      7: vif2.0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
          link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
      9: vif3.0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
          link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
      11: vif4.0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000
          link/ether fe:ff:ff:ff:ff:ff brd ff:ff:ff:ff:ff:ff
      
      ifconfig
      eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              ether d8:5e:d3:2b:23:8b  txqueuelen 1000  (Ethernet)
              RX packets 53756  bytes 20202528 (19.2 MiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 27516  bytes 6309909 (6.0 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
              inet 127.0.0.1  netmask 255.0.0.0
              loop  txqueuelen 1000  (Local Loopback)
              RX packets 9556  bytes 5978566 (5.7 MiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 9556  bytes 5978566 (5.7 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      vif1.0: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
              ether fe:ff:ff:ff:ff:ff  txqueuelen 1000  (Ethernet)
              RX packets 3859  bytes 336927 (329.0 KiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 29344  bytes 9261592 (8.8 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      vif2.0: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
              ether fe:ff:ff:ff:ff:ff  txqueuelen 1000  (Ethernet)
              RX packets 4605  bytes 937140 (915.1 KiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 28965  bytes 8527803 (8.1 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      vif3.0: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
              ether fe:ff:ff:ff:ff:ff  txqueuelen 1000  (Ethernet)
              RX packets 9936  bytes 1319123 (1.2 MiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 33662  bytes 7927207 (7.5 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      vif4.0: flags=4291<UP,BROADCAST,RUNNING,NOARP,MULTICAST>  mtu 1500
              ether fe:ff:ff:ff:ff:ff  txqueuelen 1000  (Ethernet)
              RX packets 7851  bytes 1274556 (1.2 MiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 31981  bytes 8681713 (8.2 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      
      xenbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
              inet 192.168.1.13  netmask 255.255.255.0  broadcast 192.168.1.255
              ether d8:5e:d3:2b:23:8b  txqueuelen 1000  (Ethernet)
              RX packets 33215  bytes 5737003 (5.4 MiB)
              RX errors 0  dropped 0  overruns 0  frame 0
              TX packets 6334  bytes 5412414 (5.1 MiB)
              TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
      

      Any ideas?

      A 1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Maybe your card is in FC mode only, you need to flash it to be in ethernet mode (@fohdeesha could confirm)

        1 Reply Last reply Reply Quote 0
        • A Offline
          Andrew Top contributor @Pyroteq
          last edited by

          @Pyroteq If you goal is to get 10G ethernet then buy a newer better supported card... (like Intel x540)

          The ConnectX-3 is rather old and odd. You can try a firmware update/reflash to the card as old firmware on the Mellanox cards is a known problem. There are people using them with XCP.

          Also, there are several versions of the Mellanox card, including dedicated FCoE.

          If you MUST use Mellanox then use the newer (but still old) ConnectecX-4 cards (10/25G ethernet).

          1 Reply Last reply Reply Quote 0
          • P Offline
            Pyroteq
            last edited by Pyroteq

            I think I'm getting a bit closer...

            lspci -k
            04:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
                    Subsystem: Mellanox Technologies Device 0055
                    Kernel modules: mlx4_core
            
            lsmod | grep mlx
            mlx4_core             352256  0
            devlink                77824  1 mlx4_core
            
            dmesg | grep mlx
            [    7.304630] mlx4_core: Mellanox ConnectX core driver v4.0-0
            [    7.304646] mlx4_core: Initializing 0000:04:00.0
            [    7.304799] mlx4_core 0000:04:00.0: Missing UAR, aborting
            

            Seems the driver is failing to load. Google suggests it could be caused by SR-IOV (pretty sure I've got all my virtualisation options on in the BIOS) or I might be able to play around with the grub configuration to get it working.

            The reason I went with this card in particular was because of the form factor. In this machine I've got 2 graphics cards and a TV tuner card already. I have a 1X slot (which basically nothing will fit in) and the bottom X16 slot only runs at 4X (which the ConnectX3 is a 4x PCI-e 3.0 card).

            Supposedly another possible fix is to edit the firmware and flash a firmware that limits or removes SR-IOV functions from the card.

            EDIT - Took out of hypervisor and installed into my gaming rig. Different motherboard, but same chipset (AMD X570). Windows 11 picked up the card instantly without need for driver installation. Plugged DAC into a 10Gbit SFP+ Juniper switch and Windows reads 10Gbit speed.

            So card is definitely working properly but XCP-NG 8.3 doesn't like it.

            AnonabharA 1 Reply Last reply Reply Quote 0
            • AnonabharA Offline
              Anonabhar @Pyroteq
              last edited by Anonabhar

              @Pyroteq Just as a point of reference, I use Mellanox ConnectX-3 cards all the time in my home lab for 40G Ethernet connection to my SAN. No problems with them at all. But as everyone has mentioned before, be careful of the firmware and model.. Some of the cards will only work in IB mode and others can work in IB or ETH mode.

              [09:43 xcp-ng-GHF4 ~]# lsmod | grep mlx
              mlx4_en               135168  0 
              mlx4_core             352256  1 mlx4_en
              devlink                77824  2 mlx4_core,mlx4_en
              [09:44 xcp-ng-GHF4 ~]#
              
              [09:44 xcp-ng-GHF4 ~]# dmesg | grep mlx
              [   16.566265] mlx4_core: Mellanox ConnectX core driver v4.0-0
              [   16.566299] mlx4_core: Initializing 0000:03:00.0
              [   22.903721] mlx4_core 0000:03:00.0: DMFS high rate steer mode is: disabled performance optimized steering
              [   22.910404] mlx4_core 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link)
              [   23.122348] mlx4_en: Mellanox ConnectX HCA Ethernet driver v4.0-0
              [   23.122566] mlx4_en 0000:03:00.0: Activating port:1
              [   23.127040] mlx4_en: 0000:03:00.0: Port 1: Using 16 TX rings
              [   23.127042] mlx4_en: 0000:03:00.0: Port 1: Using 16 RX rings
              [   23.127454] mlx4_en: 0000:03:00.0: Port 1: Initializing port
              [   23.128547] mlx4_en 0000:03:00.0: registered PHC clock
              [   23.128902] mlx4_en 0000:03:00.0: Activating port:2
              [   23.131517] mlx4_en: 0000:03:00.0: Port 2: Using 16 TX rings
              [   23.131518] mlx4_en: 0000:03:00.0: Port 2: Using 16 RX rings
              [   23.131678] mlx4_en: 0000:03:00.0: Port 2: Initializing port
              [   27.210288] mlx4_core 0000:03:00.0 side-9894-eth2: renamed from eth2
              [   27.244783] mlx4_core 0000:03:00.0 side-701-eth3: renamed from eth3
              [   35.418878] mlx4_core 0000:03:00.0 eth0: renamed from side-9894-eth2
              [   35.582138] mlx4_core 0000:03:00.0 eth4: renamed from side-701-eth3
              [   38.171360] mlx4_en: eth4: Steering Mode 1
              [   38.189737] mlx4_en: eth4: Link Up
              [   38.726934] mlx4_en: eth0: Steering Mode 1
              [   38.744587] mlx4_en: eth0: Link Up
              [   41.586228] mlx4_en: eth4: Steering Mode 1
              [   41.756191] mlx4_en: eth0: Steering Mode 1
              
              P 1 Reply Last reply Reply Quote 0
              • P Offline
                Pyroteq @Anonabhar
                last edited by

                @Anonabhar

                I just edited my post as you posted. I just tried it on Windows 11 and it worked instantly. This particular card doesn't support Infiniband, it's ethernet only so I don't believe that should be relevant in this case. It's the CX311A EN model.

                The problem seems to lie in the drivers on XCP-NG or in the firmware of the card.

                I'm not sure if I should be upgrading or downgrading the firmware or as some people suggested editing the firmware and limiting some aspects of the card to try and get it working.

                I saw some posts about PCI=REALLOC=OFF in Grub from the Red Hat community support page, but wouldn't I want this on for a hypervisor?

                AnonabharA 1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  I'm not aware of existing issues with those cards in the past (except IB vs Ethernet) 🤔 So that's weird.

                  1 Reply Last reply Reply Quote 0
                  • AnonabharA Offline
                    Anonabhar @Pyroteq
                    last edited by

                    @Pyroteq I have seen others mention that its a lack of physical resources being allocated to the card.

                    As far as the firmware.. I think fw-ConnectX3-rel-2_42 is the latest for that series of card.

                    P 1 Reply Last reply Reply Quote 0
                    • P Offline
                      Pyroteq @Anonabhar
                      last edited by

                      @Anonabhar seems to be an issue with the machine itself.

                      Tested my gaming rig and my server.

                      They're both AMD X570 chipsets.

                      Loaded up Ubuntu 22.04 live USB on each system. On my gaming rig it worked straight away without issue. Connected DAC to switch and it reported a 10gbit connection. On the server it has the same UAR Missing error message in dmesg and didn't show up as a network interface but appeared in lspci just as before.

                      Either I'm missing a setting hidden away in the BIOS or there's some very weird hardware conflict or incompatibility.

                      I'm installing Ubuntu to a portal SSD now and then I'll try out the firmware tools and see if I have any luck.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Good idea, I have the impression there's something weird about this setup 🤔

                        P 1 Reply Last reply Reply Quote 0
                        • P Offline
                          Pyroteq @olivierlambert
                          last edited by Pyroteq

                          @olivierlambert

                          In the end the solution was so simple... I was up until 2am trying to fix this. Updating the firmware, messing up Grub on my XCP-NG installation and having to recover off an XCP-NG 8.2 backup, hours wasted troubleshooting the Mellanox firmware tools (During installation you have to enter your password multiple times and there's no prompt or anything - No idea how someone discovered that work around).

                          I saw lots of people mentioning adding pci=realloc=off to kernel boot parameters and I tried that without success.

                          Just for the lolz I figured what if I turned it on instead?

                          pci=realloc=on
                          
                          dmesg | grep UAR
                          

                          Can no longer see the UAR missing error from before... Seems promising...

                          lspci -k | grep -A 2 Mellanox
                          04:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
                                Subsystem: Mellanox Technologies Device 0055
                                Kernel driver in use: mlx4_core
                                Kernel modules: mlx4_core
                          

                          Kernel driver is in use now, before it wasn't.

                          Rescan PIF in XOA and I can see it! Shows connected at 10gbit, is able to get IP address via DHCP, etc.

                          Everything seems to be working...

                          Now I've just gotta test that all my other PCI-e devices are passing through ok with this option enabled.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            That's weird, never heard about this option to be needed to make it work, does it ring a bell @fohdeesha ?

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post