XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    PCI Nvidia GPU Passthrough boot delay

    Scheduled Pinned Locked Moved Solved Compute
    11 Posts 3 Posters 1.9k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • T Offline
      tomg
      last edited by tomg

      Hi all,

      I've had great success passing through Nvidia GPUs from Quadros to Teslas and now Ampere. I've had no problems with multiple GPUs as well. However, I cannot get rid of a slight delay on VM start which appears to be triggered by a call to the PCI device which hangs? The timeout varies on GPU model it would seem. It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.

      What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.

      Here is what this looks like from qemu's perspective on an A4000, for some reason the calls appear staggered on this device so it completes for both devices in 20s.

      Apr  5 21:09:40 qemu-dm-2[9040]: Moving to cgroup slice ''
      Apr  5 21:09:40 qemu-dm-2[9040]: core dump limit: 67108864
      Apr  5 21:09:40 qemu-dm-2[9040]: qemu-dm-2: Machine type 'pc-0.10' is deprecated: use a newer machine type instead
      Apr  5 21:09:40 qemu-dm-2[9040]: char device redirected to /dev/pts/3 (label serial0)
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:05.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:05.0] If the device doesn't work, try enabling permissive mode
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:05.0] (unsafe) and if it helps report the problem to xen-devel
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:06.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:06.0] If the device doesn't work, try enabling permissive mode
      Apr  5 21:10:00 qemu-dm-2[9040]: [00:06.0] (unsafe) and if it helps report the problem to xen-devel
      

      Here is the same for a single A100, as can be seen we just sit there from 06:23:42 to 06:25:17

      Apr  5 06:23:42 qemu-dm-4[3823]: Moving to cgroup slice ''
      Apr  5 06:23:42 qemu-dm-4[3823]: core dump limit: 67108864
      Apr  5 06:23:42 qemu-dm-4[3823]: qemu-dm-4: Machine type 'pc-0.10' is deprecated: use a newer machine type instead
      Apr  5 06:23:42 qemu-dm-4[3823]: char device redirected to /dev/pts/2 (label serial0)
      Apr  5 06:25:17 qemu-dm-4[3823]: [00:05.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
      Apr  5 06:25:17 qemu-dm-4[3823]: [00:05.0] If the device doesn't work, try enabling permissive mode
      Apr  5 06:25:17 qemu-dm-4[3823]: [00:05.0] (unsafe) and if it helps report the problem to xen-devel
      

      This brings me to PCI permissive mode. I thought, why not try it to see if whatever this call is actually gets made in permissive mode. But, try as I might I cannot get PCI permissive mode to enable on the actual domU.

      I've booted dom0 w/ xen-pciback.permissive and even tried pci=resource_alignment= which I believe is deprecated. Xen's pci-back tells me permissive mode is on

      cat /sys/module/xen_pciback/parameters/permissive
      Y
      

      I even set the mode by hand and verified the device(s) in question show up here

      cat /sys/bus/pci/drivers/pciback/permissive
      0000:ca:00.0
      0000:98:00.0
      0000:4b:00.0
      0000:31:00.0
      

      Even the kernel says, hey I'm enabling PCI permissive on these devices

      [Tue Apr  5 21:05:00 2022] pciback 0000:31:00.0: enabling permissive mode configuration space accesses!
      [Tue Apr  5 21:05:00 2022] pciback 0000:31:00.0: permissive mode is potentially unsafe!
      [Tue Apr  5 21:05:00 2022] pciback 0000:4b:00.0: enabling permissive mode configuration space accesses!
      [Tue Apr  5 21:05:00 2022] pciback 0000:4b:00.0: permissive mode is potentially unsafe!
      [Tue Apr  5 21:05:00 2022] pciback 0000:ca:00.0: enabling permissive mode configuration space accesses!
      [Tue Apr  5 21:05:00 2022] pciback 0000:ca:00.0: permissive mode is potentially unsafe!
      [Tue Apr  5 21:05:00 2022] pciback 0000:98:00.0: enabling permissive mode configuration space accesses!
      [Tue Apr  5 21:05:00 2022] pciback 0000:98:00.0: permissive mode is potentially unsafe!
      

      I also added pci_permissive=1 to the VMs other-config (along with the PCI addresses of course). I even tried something like other-config:pci=0/0000:ca:00.0,permissive=1 as a wild shot.

      After all this, the domU still boots with permissive mode disabled

      xenopsd-xc: [debug||32 ||xenops] QMP command for domid 2: {"execute":"device_add","id":"qmp-000007-2","arguments":{"driver":"xen-pci-passthrough","id":"pci-pt-ca_00.0","hostaddr":"0000:ca:00.0","permissive":false}}
      

      Has anyone had success enabling permissive mode? Am I missing something. Or speaking about the larger problem, has anyone encountered this weird delay on domU start w/ GPU passthrough?strikethrough text

      A 1 Reply Last reply Reply Quote 0
      • T tomg marked this topic as a question on
      • A Offline
        andyhhp Xen Guru @tomg
        last edited by

        @tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.

        HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.

        IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).

        When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.

        1 Reply Last reply Reply Quote 2
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          If I understood correctly what I've been told, the "hand time" is in fact mapping (and remapping) the PCIe BAR.

          It takes some time to be done, until we (ie the Xen community) got some patches merged in Xen, allowing to enjoy IOMMU superpages, which should turn all of this into a fraction of a second. I don't know the current status of this.

          Regarding permissive thing, I don't know I have to ask around too.

          T 1 Reply Last reply Reply Quote 1
          • A Offline
            andyhhp Xen Guru @tomg
            last edited by andyhhp

            @tomg said in PCI Nvidia GPU Passthrough enable permissive?:

            It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.
            What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.

            These are known, and yeah - they are not great. It's an issue in Xen where the IOMMU logic doesn't (yet) support superpage mappings, so time delay you're observing is the time taken to map, unmap, and remap the GPU's massive BAR using 4k pages. (It's Qemu taking action in response to the actions of the guest.)

            The good news is that IOMMU superpage support is in progress upstream, and should turn this delay into milliseconds.

            T 2 Replies Last reply Reply Quote 2
            • T Offline
              tomg @andyhhp
              last edited by

              Thank you both for the reply.

              What's interesting is that Xen reports larger IOMMU page size support, not just 4k. So it's just lying to me that these are supported? :]

              (XEN) [    6.844448] Intel VT-d iommu 8 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.852537] Intel VT-d iommu 7 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.860614] Intel VT-d iommu 6 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.868703] Intel VT-d iommu 5 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.876790] Intel VT-d iommu 4 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.884877] Intel VT-d iommu 3 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.892963] Intel VT-d iommu 2 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.901040] Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.909131] Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB
              (XEN) [    6.917221] Intel VT-d iommu 9 supported page sizes: 4kB, 2MB, 1GB
              
              (XEN) [    7.495448] HVM: HAP page sizes: 4kB, 2MB, 1GB
              

              I gather from your explanation I guess it won't matter to get permissive mode to work or not...Still wanted to try out of curiosity.

              1 Reply Last reply Reply Quote 0
              • T Offline
                tomg
                last edited by

                Trying this on a guest without GPU drivers yields the same delay but no Write-back warning so complete proof that the delay is entirely due to the slow reading of the GPU's BAR. Not that I doubted anyone ;]

                Already made complete sense previously since the A100 has such a large BAR in comparison to the others.

                @andyhhp can you point me to the work being done upstream around IOMMU superpage support?

                1 Reply Last reply Reply Quote 0
                • T Offline
                  tomg @andyhhp
                  last edited by

                  @andyhhp said in PCI Nvidia GPU Passthrough boot delay:

                  The good news is that IOMMU superpage support is in progress upstream, and should turn this delay into milliseconds.

                  Assuming this is the work you’re referring to: https://lists.xenproject.org/archives/html/xen-devel/2022-01/msg00277.html

                  Any idea when this will be committed?

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    andyhhp Xen Guru @tomg
                    last edited by

                    @tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.

                    HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.

                    IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).

                    When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.

                    1 Reply Last reply Reply Quote 2
                    • T tomg has marked this topic as solved on
                    • T Offline
                      tomg @olivierlambert
                      last edited by

                      @olivierlambert said in PCI Nvidia GPU Passthrough boot delay:

                      If I understood correctly what I've been told, the "hand time" is in fact mapping (and remapping) the PCIe BAR.

                      It takes some time to be done, until we (ie the Xen community) got some patches merged in Xen, allowing to enjoy IOMMU superpages, which should turn all of this into a fraction of a second. I don't know the current status of this.

                      Regarding permissive thing, I don't know I have to ask around too.

                      Looks like this finally got committed last week!

                      https://github.com/xen-project/xen/commits/master/xen/drivers/passthrough

                      @olivierlambert - any idea when this will make it into XCP-ng? :]

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        You are right, but the code isn't into the right shape right now.

                        However, it's going indeed into the right direction! As soon it's possible to backport it (or to add it into a more recent release), we'll try to make it real 🙂

                        T 1 Reply Last reply Reply Quote 0
                        • T Offline
                          tomg @olivierlambert
                          last edited by

                          @olivierlambert any idea if Xen 4.17 will be coming to XCP-ng? :]

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Probably (ideally) before the end of 2024, but it's really hard to make any promises (it might be -at that point- an ever more recent version)

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post