Hi all,
I've had great success passing through Nvidia GPUs from Quadros to Teslas and now Ampere. I've had no problems with multiple GPUs as well. However, I cannot get rid of a slight delay on VM start which appears to be triggered by a call to the PCI device which hangs? The timeout varies on GPU model it would seem. It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.
What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.
Here is what this looks like from qemu's perspective on an A4000, for some reason the calls appear staggered on this device so it completes for both devices in 20s.
Apr 5 21:09:40 qemu-dm-2[9040]: Moving to cgroup slice ''
Apr 5 21:09:40 qemu-dm-2[9040]: core dump limit: 67108864
Apr 5 21:09:40 qemu-dm-2[9040]: qemu-dm-2: Machine type 'pc-0.10' is deprecated: use a newer machine type instead
Apr 5 21:09:40 qemu-dm-2[9040]: char device redirected to /dev/pts/3 (label serial0)
Apr 5 21:10:00 qemu-dm-2[9040]: [00:05.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
Apr 5 21:10:00 qemu-dm-2[9040]: [00:05.0] If the device doesn't work, try enabling permissive mode
Apr 5 21:10:00 qemu-dm-2[9040]: [00:05.0] (unsafe) and if it helps report the problem to xen-devel
Apr 5 21:10:00 qemu-dm-2[9040]: [00:06.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
Apr 5 21:10:00 qemu-dm-2[9040]: [00:06.0] If the device doesn't work, try enabling permissive mode
Apr 5 21:10:00 qemu-dm-2[9040]: [00:06.0] (unsafe) and if it helps report the problem to xen-devel
Here is the same for a single A100, as can be seen we just sit there from 06:23:42 to 06:25:17
Apr 5 06:23:42 qemu-dm-4[3823]: Moving to cgroup slice ''
Apr 5 06:23:42 qemu-dm-4[3823]: core dump limit: 67108864
Apr 5 06:23:42 qemu-dm-4[3823]: qemu-dm-4: Machine type 'pc-0.10' is deprecated: use a newer machine type instead
Apr 5 06:23:42 qemu-dm-4[3823]: char device redirected to /dev/pts/2 (label serial0)
Apr 5 06:25:17 qemu-dm-4[3823]: [00:05.0] Write-back to unknown field 0xc4 (partially) inhibited (0x00000000)
Apr 5 06:25:17 qemu-dm-4[3823]: [00:05.0] If the device doesn't work, try enabling permissive mode
Apr 5 06:25:17 qemu-dm-4[3823]: [00:05.0] (unsafe) and if it helps report the problem to xen-devel
This brings me to PCI permissive mode. I thought, why not try it to see if whatever this call is actually gets made in permissive mode. But, try as I might I cannot get PCI permissive mode to enable on the actual domU.
I've booted dom0 w/ xen-pciback.permissive
and even tried pci=resource_alignment=
which I believe is deprecated. Xen's pci-back tells me permissive mode is on
cat /sys/module/xen_pciback/parameters/permissive
Y
I even set the mode by hand and verified the device(s) in question show up here
cat /sys/bus/pci/drivers/pciback/permissive
0000:ca:00.0
0000:98:00.0
0000:4b:00.0
0000:31:00.0
Even the kernel says, hey I'm enabling PCI permissive on these devices
[Tue Apr 5 21:05:00 2022] pciback 0000:31:00.0: enabling permissive mode configuration space accesses!
[Tue Apr 5 21:05:00 2022] pciback 0000:31:00.0: permissive mode is potentially unsafe!
[Tue Apr 5 21:05:00 2022] pciback 0000:4b:00.0: enabling permissive mode configuration space accesses!
[Tue Apr 5 21:05:00 2022] pciback 0000:4b:00.0: permissive mode is potentially unsafe!
[Tue Apr 5 21:05:00 2022] pciback 0000:ca:00.0: enabling permissive mode configuration space accesses!
[Tue Apr 5 21:05:00 2022] pciback 0000:ca:00.0: permissive mode is potentially unsafe!
[Tue Apr 5 21:05:00 2022] pciback 0000:98:00.0: enabling permissive mode configuration space accesses!
[Tue Apr 5 21:05:00 2022] pciback 0000:98:00.0: permissive mode is potentially unsafe!
I also added pci_permissive=1
to the VMs other-config
(along with the PCI addresses of course). I even tried something like other-config:pci=0/0000:ca:00.0,permissive=1
as a wild shot.
After all this, the domU still boots with permissive mode disabled
xenopsd-xc: [debug||32 ||xenops] QMP command for domid 2: {"execute":"device_add","id":"qmp-000007-2","arguments":{"driver":"xen-pci-passthrough","id":"pci-pt-ca_00.0","hostaddr":"0000:ca:00.0","permissive":false}}
Has anyone had success enabling permissive mode? Am I missing something. Or speaking about the larger problem, has anyone encountered this weird delay on domU start w/ GPU passthrough?strikethrough text