XCP-ng and NVIDIA GPUs

MajorTom

apayne

olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.