XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    NVIDIA GPU passthrough on XCP-ng 8.3 fails after reboot — UUID/PCI ID changes

    Scheduled Pinned Locked Moved Hardware
    10 Posts 3 Posters 1.3k Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • samuelolavoS Online
      samuelolavo
      last edited by

      I’m trying to use passthrough for an NVIDIA GPU on XCP-ng 8.3.
      The host detects the GPU (lspci shows VGA + Audio), and IOMMU is enabled.

      However, whenever I apply xen-pciback.hide and reboot the host, XCP-ng generates a new internal UUID and new PCI ID for the GPU.
      As a result, I cannot even assign the GPU to a VM, because the device “changes” its reference after each reboot.

      Additional important context:

      I previously had a different GPU doing passthrough in a VM.

      That GPU was removed.

      The new GPU was installed in a different slot.

      I suspect the issues may be related to leftover metadata from the previous GPU, which prevents the new GPU from being recognized consistently for passthrough.

      Question:
      Has anyone encountered this problem? Is there a way to completely clear the old passthrough state on the host or ensure that the new GPU is recognized consistently for passthrough?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        Clear entirely the boot parameter with /opt/xensource/libexec/xen-cmdline --delete-dom0 xen-pciback.hide, reboot, and re-assign.

        samuelolavoS 1 Reply Last reply Reply Quote 0
        • samuelolavoS Online
          samuelolavo @olivierlambert
          last edited by samuelolavo

          @olivierlambert

          Thank you for your reply.

          I followed your instructions; however, when I passthrough the GPU NVIDIA on the XenOrchestra and he same error occurs when I reboot.

          Change the PCI and UUID of the graphics card to the one before rebooting.

          samuelolavoS 1 Reply Last reply Reply Quote 0
          • samuelolavoS Online
            samuelolavo @samuelolavo
            last edited by

            This post is deleted!
            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Can you tell us on what platform you are doing this? It looks like a buggy PCI reset.

              samuelolavoS 1 Reply Last reply Reply Quote 0
              • samuelolavoS Online
                samuelolavo @olivierlambert
                last edited by samuelolavo

                @olivierlambert

                I have a the XCP8.3 installed.
                Xen Orchestra, commit bcee5 .

                When I add it manually, the audio, for example, it asks for a reboot, but then, when I restart the server, it has another pci ID, and another ID.

                aba3b57a-fb08-4af6-9f05-3c07b4e494e8-image.png

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Oo I'm not sure to follow. But let me add @Team-Hypervisor-Kernel in case that rings a bell.

                  1 Reply Last reply Reply Quote 0
                  • Y Offline
                    yannsionneau Vates 🪐 XCP-ng Team
                    last edited by

                    Hi,

                    @samuelolavo can you run the following command on the dom0 and paste its output please:

                    grep -A5 "'XCP-ng'" /etc/grub.cfg
                    

                    Then also run

                    lspci -vvv
                    xe pci-list
                    xl pci-assignable-list
                    
                    samuelolavoS 1 Reply Last reply Reply Quote 0
                    • samuelolavoS Online
                      samuelolavo @yannsionneau
                      last edited by samuelolavo

                      @yannsionneau Hi,
                      Sorry for the delay...

                      menuentry 'XCP-ng' {
                              search --label --set root root-zrxcsq
                              multiboot2 /boot/xen.gz dom0_mem=8192M,max:8192M watchdog ucode=scan dom0_max_vcpus=1-16 crashkernel=256M,below=4G console=vga vga=mode-0x0311
                              module2 /boot/vmlinuz-4.19-xen root=LABEL=root-zrxcsq ro nolvm hpet=disable console=hvc0 console=tty0 quiet vga=785 splash plymouth.ignore-serial-consoles xen-pciback.hide=(0000:03:00.0)
                              module2 /boot/initrd-4.19-xen.img
                      }
                      
                      
                      03:00.0 VGA compatible controller: NVIDIA Corporation GB202GL [RTX PRO 6000 Blackwell Max-Q Workstation Edition] (rev a1) (prog-if 00 [VGA controller])
                              Subsystem: NVIDIA Corporation Device 204c
                              Physical Slot: 10
                              Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Interrupt: pin A routed to IRQ 89
                              Region 0: Memory at f4000000 (32-bit, non-prefetchable) [disabled] [size=64M]
                              Region 1: Memory at 70060000000 (64-bit, prefetchable) [disabled] [size=256M]
                              Region 3: Memory at 70070000000 (64-bit, prefetchable) [disabled] [size=32M]
                              Region 5: I/O ports at 1000 [disabled] [size=128]
                              Expansion ROM at f8000000 [disabled] [size=512K]
                              Capabilities: [40] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [48] MSI: Enable- Count=1/16 Maskable+ 64bit+
                                      Address: 0000000000000000  Data: 0000
                                      Masking: 00000000  Pending: 00000000
                              Capabilities: [60] Express (v2) Legacy Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                                              MaxPayload 256 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 unlimited
                                              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Via WAKE#
                                      LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                                               Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                               Compliance De-emphasis: -6dB
                                      LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                                               EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
                              Capabilities: [9c] Vendor Specific Information: Len=14 <?>
                              Capabilities: [100 v1] #19
                              Capabilities: [12c v1] Latency Tolerance Reporting
                                      Max snoop latency: 1048576ns
                                      Max no snoop latency: 1048576ns
                              Capabilities: [134 v1] #15
                              Capabilities: [14c v1] #25
                              Capabilities: [158 v1] #26
                              Capabilities: [188 v1] #2a
                              Capabilities: [1b8 v2] Advanced Error Reporting
                                      UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                                      UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                      CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                              Capabilities: [200 v1] #27
                              Capabilities: [248 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 1
                                      ARICtl: MFVC- ACS-, Function Group: 0
                              Capabilities: [2a4 v1] Vendor Specific Information: ID=0001 Rev=1 Len=014 <?>
                              Capabilities: [2bc v1] Power Budgeting <?>
                              Capabilities: [2f4 v1] Device Serial Number 18-a6-fe-7f-8f-2d-b0-48
                              Kernel driver in use: pciback
                      
                      03:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
                              Subsystem: NVIDIA Corporation Device 0000
                              Physical Slot: 10
                              Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Latency: 0, Cache Line Size: 64 bytes
                              Interrupt: pin B routed to IRQ 10
                              Region 0: Memory at f8080000 (32-bit, non-prefetchable) [size=16K]
                              Capabilities: [40] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [48] MSI: Enable- Count=1/1 Maskable+ 64bit+
                                      Address: 0000000000000000  Data: 0000
                                      Masking: 00000000  Pending: 00000000
                              Capabilities: [60] Express (v2) Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                                              MaxPayload 256 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L1, Exit Latency L0s unlimited, L1 unlimited
                                              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                                      LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                                               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                              Capabilities: [9c] Vendor Specific Information: Len=14 <?>
                              Capabilities: [100 v1] #25
                              Capabilities: [10c v2] Advanced Error Reporting
                                      UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                      UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                      CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                                      CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                              Capabilities: [154 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 0
                                      ARICtl: MFVC- ACS-, Function Group: 0
                      
                      09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Function (rev 01)
                              Subsystem: Advanced Micro Devices, Inc. [AMD] Genoa/Bergamo Dummy Function
                              Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                              Capabilities: [48] Vendor Specific Information: Len=08 <?>
                              Capabilities: [50] Power Management version 3
                                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                              Capabilities: [64] Express (v2) Endpoint, MSI 00
                                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                                      DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
                                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                                              MaxPayload 128 bytes, MaxReadReq 512 bytes
                                      DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                      LnkCap: Port #0, Speed unknown, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                                              ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                                              ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                      LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                                      DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                                      LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-
                                               Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                               Compliance De-emphasis: -6dB
                                      LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                                               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                              Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
                              Capabilities: [270 v1] #19
                              Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
                                      ARICap: MFVC- ACS-, Next Function: 1
                                      ARICtl: MFVC- ACS-, Function Group: 0
                              Capabilities: [410 v1] #26
                              Capabilities: [450 v1] #27
                              Capabilities: [500 v1] #2a
                      
                      
                      
                       xe pci-list
                      uuid ( RO)           : 73708288-55ec-b17f-ba73-6d2c116b3bbc
                          vendor-name ( RO): NVIDIA Corporation
                          device-name ( RO): GB202GL [RTX PRO 6000 Blackwell Max-Q Workstation Edition]
                               pci-id ( RO): 0000:03:00.0
                      
                      
                      uuid ( RO)           : c94f0327-8c86-3aa8-dd7c-9389ae1123f5
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:81:00.1
                      
                      
                      uuid ( RO)           : 09e0f3b1-18bb-8a6e-97d8-0209a8e4a97c
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:0a:00.1
                      
                      
                      uuid ( RO)           : cc5feb5c-b8aa-a975-de76-513d309f8e73
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:41:00.0
                      
                      
                      uuid ( RO)           : 72baa4c2-13b3-22fb-cc87-b10033ccb025
                          vendor-name ( RO): Broadcom / LSI
                          device-name ( RO): MegaRAID 12GSAS/PCIe Secure SAS39xx
                               pci-id ( RO): 0000:c1:00.0
                      
                      
                      uuid ( RO)           : d2d40d25-4b69-7f6a-a8e2-3101ad80fcb6
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:41:00.1
                      
                      
                      uuid ( RO)           : 630bdaee-0e03-b8a1-c726-4f34230e89f7
                          vendor-name ( RO): Intel Corporation
                          device-name ( RO): Ethernet Controller X550
                               pci-id ( RO): 0000:81:00.0
                      
                      
                      uuid ( RO)           : d4023077-83fe-a0b7-5f3f-516204c2c1d1
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:ce:00.1
                      
                      
                      uuid ( RO)           : c67855cd-4908-0711-7424-a6db2eb011f0
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:0a:00.0
                      
                      
                      uuid ( RO)           : ef1d93e4-e82b-c33c-a488-8f7a9129eb8a
                          vendor-name ( RO): NVIDIA Corporation
                          device-name ( RO): Device 22e8
                               pci-id ( RO): 0000:03:00.1
                      
                      
                      uuid ( RO)           : b6235c56-4070-dc4d-9db2-5e361f38d2b2
                          vendor-name ( RO): Advanced Micro Devices, Inc. [AMD]
                          device-name ( RO): FCH SATA Controller [AHCI mode]
                               pci-id ( RO): 0000:ce:00.0
                      
                      
                      uuid ( RO)           : 5c7258ae-504b-9ac4-8b16-7129b8d8455d
                          vendor-name ( RO): ASPEED Technology, Inc.
                          device-name ( RO): ASPEED Graphics Family
                               pci-id ( RO): 0000:cc:00.0
                      
                      
                      xl pci-assignable-list
                      0000:03:00.0
                      
                      

                      Model: Supermicro AS-2015CS-TNR

                      Y 1 Reply Last reply Reply Quote 0
                      • Y Offline
                        yannsionneau Vates 🪐 XCP-ng Team @samuelolavo
                        last edited by yannsionneau

                        @samuelolavo Thanks for your answer

                        It's very weird because by seeing the command outputs that you pasted, it looks like everything is behaving as it should be.
                        Even the PCI ID (segment:bus:device:function) seems to stay correct (0000:03:00.0)

                        I'll ask others internally.

                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post