XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    NVIDIA GPU passthrough on XCP-ng 8.3 fails after reboot — UUID/PCI ID changes

    Scheduled Pinned Locked Moved Hardware
    9 Posts 3 Posters 1.1k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S Offline
      samuelolavo
      last edited by

      I’m trying to use passthrough for an NVIDIA GPU on XCP-ng 8.3.
      The host detects the GPU (lspci shows VGA + Audio), and IOMMU is enabled.

      However, whenever I apply xen-pciback.hide and reboot the host, XCP-ng generates a new internal UUID and new PCI ID for the GPU.
      As a result, I cannot even assign the GPU to a VM, because the device “changes” its reference after each reboot.

      Additional important context:

      I previously had a different GPU doing passthrough in a VM.

      That GPU was removed.

      The new GPU was installed in a different slot.

      I suspect the issues may be related to leftover metadata from the previous GPU, which prevents the new GPU from being recognized consistently for passthrough.

      Question:
      Has anyone encountered this problem? Is there a way to completely clear the old passthrough state on the host or ensure that the new GPU is recognized consistently for passthrough?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        Clear entirely the boot parameter with /opt/xensource/libexec/xen-cmdline --delete-dom0 xen-pciback.hide, reboot, and re-assign.

        S 1 Reply Last reply Reply Quote 0
        • S Offline
          samuelolavo @olivierlambert
          last edited by samuelolavo

          @olivierlambert

          Thank you for your reply.

          I followed your instructions; however, when I passthrough the GPU NVIDIA on the XenOrchestra and he same error occurs when I reboot.

          Change the PCI and UUID of the graphics card to the one before rebooting.

          S 1 Reply Last reply Reply Quote 0
          • S Offline
            samuelolavo @samuelolavo
            last edited by

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Can you tell us on what platform you are doing this? It looks like a buggy PCI reset.

              S 1 Reply Last reply Reply Quote 0
              • S Offline
                samuelolavo @olivierlambert
                last edited by samuelolavo

                @olivierlambert

                I have a the XCP8.3 installed.
                Xen Orchestra, commit bcee5 .

                When I add it manually, the audio, for example, it asks for a reboot, but then, when I restart the server, it has another pci ID, and another ID.

                aba3b57a-fb08-4af6-9f05-3c07b4e494e8-image.png

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Oo I'm not sure to follow. But let me add @Team-Hypervisor-Kernel in case that rings a bell.

                  1 Reply Last reply Reply Quote 0
                  • Y Offline
                    yannsionneau Vates 🪐 XCP-ng Team
                    last edited by

                    Hi,

                    @samuelolavo can you run the following command on the dom0 and paste its output please:

                    grep -A5 "'XCP-ng'" /etc/grub.cfg
                    

                    Then also run

                    lspci -vvv
                    xe pci-list
                    xl pci-assignable-list
                    
                    S 1 Reply Last reply Reply Quote 0
                    • S Offline
                      samuelolavo @yannsionneau
                      last edited by samuelolavo

                      @yannsionneau Hi,
                      Sorry for the delay...

                      menuentry 'XCP-ng' {
                              search --label --set root root-zrxcsq
                              multiboot2 /boot/xen.gz dom0_mem=8192M,max:8192M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311
                              module2 /boot/vmlinuz-4.19-xen root=LABEL=root-zrxcsq ro nolvm hpet=disable console=hvc0 console=tty0 quiet vga=785 splash plymouth.ignore-serial-consoles xen-pciback.hide=(0000:03:00.0)
                              module2 /boot/initrd-4.19-xen.img
                      }
                      
                      
                      03:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Max-Q Workstation Edition] (rev a1) (prog-if 00 [VGA controller])
                              Subsystem: NVIDIA Corporation Device 204c
                              Physical Slot: 10
                              Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Interrupt: pin A routed to IRQ 89
                              Region 0: Memory at f4000000 (32-bit, non-prefetchable) [disabled] [size=64M]
                              Region 1: Memory at 70060000000 (64-bit, prefetchable) [disabled] [size=256M]
                              Region 3: Memory at 70070000000 (64-bit, prefetchable) [disabled] [size=32M]
                              Region 5: I/O ports at 1000 [disabled] [size=128]
                              Expansion ROM at f8000000 [disabled] [size=512K]
                              Capabilities: [40] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
                                      Address: 0000000000000000  Data: 0000
                                      Masking: 00000000  Pending: 00000000
                              Capabilities: [60] Express (v2) Legacy Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                                              MaxPayload 256 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 unlimited
                                              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Via WAKE#
                                      LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                                               Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                               Compliance De-emphasis: -6dB
                                      LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                                               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
                              Capabilities: [9c] Vendor Specific Information: Len=14 <?>
                              Capabilities: [100 v1] #19
                              Capabilities: [12c v1] Latency Tolerance Reporting
                                      Max snoop latency: 1048576ns
                                      Max no snoop latency: 1048576ns
                              Capabilities: [134 v1] #15
                              Capabilities: [14c v1] #25
                              Capabilities: [158 v1] #26
                              Capabilities: [188 v1] #2a
                              Capabilities: [1b8 v2] Advanced Error Reporting
                                      UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                                      UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                      CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                              Capabilities: [200 v1] #27
                              Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 1
                                      ARICtl: MFVC- ACS-, Function Group: 0
                              Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
                              Capabilities: [2bc v1] Power Budgeting <?>
                              Capabilities: [2f4 v1] Device Serial Number 18-a6-fe-7f-8f-2d-b0-48
                              Kernel driver in use: pciback
                      
                      03:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
                              Subsystem: NVIDIA Corporation Device 0000
                              Physical Slot: 10
                              Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Latency: 0, Cache Line Size: 64 bytes
                              Interrupt: pin B routed to IRQ 10
                              Region 0: Memory at f8080000 (32-bit, non-prefetchable) [size=16K]
                              Capabilities: [40] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+
                                      Address: 0000000000000000  Data: 0000
                                      Masking: 00000000  Pending: 00000000
                              Capabilities: [60] Express (v2) Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                                              MaxPayload 256 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 unlimited
                                              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                                      LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                                               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                              Capabilities: [9c] Vendor Specific Information: Len=14 <?>
                              Capabilities: [100 v1] #25
                              Capabilities: [10c v2] Advanced Error Reporting
                                      UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                      CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                                      CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                              Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 0
                                      ARICtl: MFVC- ACS-, Function Group: 0
                      
                      09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Function (rev 01)
                              Subsystem: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Function
                              Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Capabilities: [48] Vendor Specific Information: Len=08 <?>
                              Capabilities: [50] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [64] Express (v2) Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                                              MaxPayload 128 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                                              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                                      LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                                               Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                               Compliance De-emphasis: -6dB
                                      LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                                               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                              Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
                              Capabilities: [270 v1] #19
                              Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 1
                                      ARICtl: MFVC- ACS-, Function Group: 0
                              Capabilities: [410 v1] #26
                              Capabilities: [450 v1] #27
                              Capabilities: [500 v1] #2a
                      
                      
                      
                       xe pci-list
                      uuid ( RO)           : 73708288-55ec-b17f-ba73-6d2c116b3bbc
                          vendor-name ( RO): NVIDIA Corporation
                          device-name ( RO): GB202GL [RTX PRO 6000 Blackwell Max-Q Workstation Edition]
                               pci-id ( RO): 0000:03:00.0
                      
                      
                      uuid ( RO)           : c94f0327-8c86-3aa8-dd7c-9389ae1123f5
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:81:00.1
                      
                      
                      uuid ( RO)           : 09e0f3b1-18bb-8a6e-97d8-0209a8e4a97c
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:0a:00.1
                      
                      
                      uuid ( RO)           : cc5feb5c-b8aa-a975-de76-513d309f8e73
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:41:00.0
                      
                      
                      uuid ( RO)           : 72baa4c2-13b3-22fb-cc87-b10033ccb025
                          vendor-name ( RO): Broadcom / LSI
                          device-name ( RO): MegaRAID 12GSAS/PCIe Secure SAS39xx
                               pci-id ( RO): 0000:c1:00.0
                      
                      
                      uuid ( RO)           : d2d40d25-4b69-7f6a-a8e2-3101ad80fcb6
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:41:00.1
                      
                      
                      uuid ( RO)           : 630bdaee-0e03-b8a1-c726-4f34230e89f7
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:81:00.0
                      
                      
                      uuid ( RO)           : d4023077-83fe-a0b7-5f3f-516204c2c1d1
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:ce:00.1
                      
                      
                      uuid ( RO)           : c67855cd-4908-0711-7424-a6db2eb011f0
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:0a:00.0
                      
                      
                      uuid ( RO)           : ef1d93e4-e82b-c33c-a488-8f7a9129eb8a
                          vendor-name ( RO): NVIDIA Corporation
                          device-name ( RO): Device 22e8
                               pci-id ( RO): 0000:03:00.1
                      
                      
                      uuid ( RO)           : b6235c56-4070-dc4d-9db2-5e361f38d2b2
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:ce:00.0
                      
                      
                      uuid ( RO)           : 5c7258ae-504b-9ac4-8b16-7129b8d8455d
                          vendor-name ( RO): ASPEED Technology, Inc.
                          device-name ( RO): ASPEED Graphics Family
                               pci-id ( RO): 0000:cc:00.0
                      
                      
                      xl pci-assignable-list
                      0000:03:00.0
                      
                      

                      Model: Supermicro AS-2015CS-TNR

                      1 Reply Last reply Reply Quote 0
                      • First post
                        Last post