XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    PCIe USB card (and PCIe bridge) disappear after host reboot

    Scheduled Pinned Locked Moved Solved Compute
    12 Posts 2 Posters 1.3k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • N Offline
      nvs
      last edited by nvs

      Hi,

      I have an Asus X570-Pro with a Ryzen 9 5950X CPU. After a reboot of the machine one of my PCIe devices (always one of the USB PCIe cards) is no longer detected. Once gone, it stays gone across reboots. To fix it, it seems I usually need to remove a card, power up the server, power it down again and plug the card back in. It still will be gone again after the next reboot after that again.

      I have the following cards installed in its 6 PCIe slots:

      • 4 port USB card (running x1)
      • 7 port USB card (running x1)
      • 10Gbit network card (should be running x8)
      • 7 port USB card (running x1)
      • 7 port USB card (running x1)
      • Nvidia Quadro K2200 GPU (should be running x8)

      This is the lspci output when all cards are detected correctly:

      [23:10 localhost ~]# lspci
      00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
      00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
      00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
      00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
      00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
      00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
      00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0
      00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1
      00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2
      00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3
      00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4
      00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5
      00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6
      00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7
      01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01)
      02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
      03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      04:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01)
      05:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      06:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      07:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      08:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
      09:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
      0a:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      0a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      0b:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
      0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
      0d:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
      0d:00.1 Audio device: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] (rev a1)
      0e:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Network Connection (rev 01)
      10:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
      11:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
      11:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
      11:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      11:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
      

      And this is the lspci output after the reboot when one of the PCIe USB cards disappears.

      [23:17 localhost ~]# lspci
      00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
      00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
      00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
      00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
      00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
      00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
      00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
      00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
      00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0
      00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1
      00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2
      00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3
      00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4
      00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5
      00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6
      00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7
      01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01)
      02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
      03:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      04:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. Device 5013 (rev 01)
      05:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      06:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      07:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
      08:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
      09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
      09:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      09:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      0a:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
      0b:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
      0c:00.0 VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
      0c:00.1 Audio device: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] (rev a1)
      0d:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Network Connection (rev 01)
      0f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
      10:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
      10:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
      10:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
      10:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
      

      What can be seen is that in the broken situation the following two devices are missing (so not just the PCIe USB card but also a device called "PCIe GPP Bridge"):
      03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
      07:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)

      I am no expert, but maybe this is a useful clue if that PCIe bridge disappears at the same time as the USB PCIe card.. I've spent the whole day troubleshooting all kinds of different slot combinations and removing/adding/reseating cards, but unfortunately to no avail.

      Any clues/help would be much appreciated!

      1 Reply Last reply Reply Quote 0
      • N Offline
        nvs
        last edited by

        After another full day of troubleshooting it looks like I found the issue..

        Installed Ubuntu Server and tested the plugged in USB cards that were detected to figure out which one was the one dropping out. Turns out if that card is in any of the PCIe slots it will cause the issues seen. If its not installed in the server no cards disappear.

        I've removed an identical and known working PCIe USB card from my 2nd machine and replaced the faulty one. It seems everything is working fine again. Quite interesting how a faulty card resulted in this rollercoaster of symptoms seen.. at least some nice lessons learned for the future 🙂

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Online
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          Hi!

          It feels like a power saving settings somewhere, or bad reset of the device in case of a reboot. Is your BIOS fully up to date? Have you checked BIOS options?

          N 1 Reply Last reply Reply Quote 0
          • N Offline
            nvs @olivierlambert
            last edited by nvs

            @olivierlambert Hi, yes im running the latest BIOS version 5003 (released just 2023/10/31).
            Any suggestions what kind of bios settings to look at in particular? I have a pretty much identical system (as far as motherboard+cpu+ram+ssds+pcie USB cards+pcie NIC go) and that one has not shown this behaviour so far. Bios version and settings in both systems I checked and should be pretty much identical I think.

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Online
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Ah that's interesting! So you can't reproduce the issue on a similar box 🤔 That's helpful to understand what could be the issue.

              1 Reply Last reply Reply Quote 0
              • N Offline
                nvs
                last edited by nvs

                Spent some more time troubleshooting. And made some interesting discoveries!

                First I tried setting various things in the bios, like:

                • Advanced->AMD CBS->Global c-state control: Was set to "auto", tried: "enabled" and "disabled"
                • Advanced->AMD CBS->CPU common options->Local APIC mode: Was "x2APIC", tried: "compatibility", "xAPIC", "auto"
                  These changes didnt help anything though, the issue remained.

                Also installed a new XCP NG on the 2nd M.2 in that server so I could test with the "normal" XCP NG against a fresh install. -> Same issue
                Set power supply from "multi rail" to "single rail" -> Same issue

                Today I made the discovery. Plugging the DVD drive into the SATA port on the mainboard, that triggers one of the PCIe USB cards to not be recognized anymore, as explained in the original post above. Interestingly, if unplugging the DVD SATA from the mainboard after it caused the issue, that PCIe USB card still remains unlisted. If I leave the server like 1h powered down and then start it up (with DVD drive not connected to any SATA port) it appears the system will keep recognizing all PCIe cards. Very interesting..

                So working config in this server is as such:

                • 6 PCIe cards (as mentioned in first post above)
                • 2x M.2 SSDs (1x 2 TB and 1x 4 TB) on the motherboard
                • 4x SATA devices attached (3x HDDs 18TB 18TB and 20TB, 1x SSD 500GB)

                If I plug in the DVD drive (which would be the 5th SATA device) it breaks things.

                Any more ideas what may be going on/be behind this?

                1 Reply Last reply Reply Quote 0
                • N Offline
                  nvs
                  last edited by nvs

                  Okay, story continues:

                  • Had DVD SATA not plugged into the mainboard.
                  • System booted fine and PCIe devices got listed fine across reboots.
                  • Plugged DVD drive in to verify it breaks things (as explained above).
                  • Disconnected DVD SATA from mainboard again, and the PCIe card indeed stayed unlisted across reboots still (as explained before).
                  • I shut down the machine and left it without power for 21 minutes
                  • I power up the machine but it doesnt boot. Black screen remains. Error LED on mainboard for "CPU" is lit up indicating some boot issue with the CPU.. Interesting.
                  • I cut the power and start the machine again. It starts up in "safe mode" and forces me into bios by pressing F1. I exit the bios with no changes.
                  • Machine restarts and then continues to boot fine again into XCP as usual. All PCIe cards are detected normally again, also across reboots.

                  And, another find: Its not specific to the DVD drive. Going from 0 plugged in SATA devices to plugging in one HDD the same issue occurs. So it seems more like if a system device is added/changed it causes the issue..

                  Curious if anyone here can make anything from this? I am happy to replace the motherboard if that fixes the issue, but can we be sure what component is really faulty here? (motherboard/cpu/ram/pcie card?)

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Online
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    That's a tricky one. Any idea @fohdeesha ?

                    1 Reply Last reply Reply Quote 0
                    • N Offline
                      nvs
                      last edited by nvs

                      Tried some more things but nothing resolved the issue:

                      • Put RAM speed from DDR4-3200 to AUTO -> Same issue

                      • Put a different GPU (removed the Nvidia K2200 GPU) but still breaks when i.e. starting with 0 plugged in SATA devices to plugging in 1st SATA HDD.. -> Same issue

                      • Reseated CPU and checked for any bent pins (looked all OK) and re-pasted it -> Same issue

                      • Tried using different output on K2200 GPU (output 2 (DP) instead of usually output 3 (DP)) -> Same issue

                      • Tried without any GPU at all (also not onboard GPU, as this CPU doesnt have integrated graphics) -> Same issue

                      • Took out PCIe USB cards one by one (had no GPU installed at all while testing that, had 10gig card in top PCIe slot for a change, and 1x HDD attached via SATA). Then removed one by one the PCIe USB cards:
                        ^Every time I remove one and boot, it shows the correct amount of PCIe USB cards first time. Then after reboot always one PCIe USB card-1 less.. That amount then also seems to stay across reboots. However, when only one PCIe USB card is left, that card seems to stay recognized and does not disappear after a reboot!

                      • Reset bios settings (still using latest BIOS version) by removing battery and shorting RTC reset pins. Left bios at untouched defaults and booted into XCP-NG -> Same issue

                      • Removed all RAM modules and installed just one RAM stick -> Same issue

                      • Downgraded BIOS to version 4408 and left at BIOS defaults -> Same issue

                      It looks like the system likes eating the PCIe USB cards. I will try ASUS customer support tomorrow but I am not expecting much from that..

                      Could this be an IRQ conflict? What still baffles me is how the issue isnt resolved if the machine is shut off for say 30 secs, but is after it was off for 10 minutes. It would then usually boot up with all cards recognized again.. In the back of my mind I am imagining some hardware failure that depends on something capacitively charged that could explain such time-delay behaviour.. Any thoughts/other ideas?

                      1 Reply Last reply Reply Quote 0
                      • N Offline
                        nvs
                        last edited by

                        After another full day of troubleshooting it looks like I found the issue..

                        Installed Ubuntu Server and tested the plugged in USB cards that were detected to figure out which one was the one dropping out. Turns out if that card is in any of the PCIe slots it will cause the issues seen. If its not installed in the server no cards disappear.

                        I've removed an identical and known working PCIe USB card from my 2nd machine and replaced the faulty one. It seems everything is working fine again. Quite interesting how a faulty card resulted in this rollercoaster of symptoms seen.. at least some nice lessons learned for the future 🙂

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Online
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          Oh gosh, what a nightmare 😞

                          Congrats on finding the issue, it was a tricky one!

                          1 Reply Last reply Reply Quote 0
                          • N Offline
                            nvs
                            last edited by

                            Yeah.. this definitely was a nightmare, I am taking a few days off after this 😃

                            1 Reply Last reply Reply Quote 1
                            • olivierlambertO Online
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              haha sure you should, well deserved!

                              1 Reply Last reply Reply Quote 1
                              • olivierlambertO olivierlambert marked this topic as a question on
                              • olivierlambertO olivierlambert has marked this topic as solved on
                              • First post
                                Last post