XCP-ng and NVIDIA GPUs

jcpt928

@apayne Yep. I ran into this issue both at work and at home - we got a really sweet deal on some 1060 Tis - tried to use them in some VMs, came to the realization that NVIDIA had locked them out in the drivers. We IT guys at least got a "nice" GPU out of it in the end - I use mine alongside my 1070 as a dedicated PhysX GPU, that also drives a couple secondary monitors for social stuff and hardware monitoring. I'll see if I can find that thread.

A side note - if you get yourself something like a Dell R720 (or most of their other 2U servers), then you will have ports for external GPU power. You'll still be limited wattage-wise; but, to a lesser extent.

apayne

@jcpt928 I plead insanity, I have a Dell R815. Honestly, the "disable Intel hyperthreading" thing is what pushed me to this AMD unit. https://github.com/speed47/spectre-meltdown-checker claims that the XCP 8.0 beta is properly patched up, "green" all the way across.

It came with 24 Cores, two drive trays, iDRAC enterprise, and a H710. Bought 12 RDIMM 4Gb ECC for $70 and shoved them in, and mirrored a boot drive with two 60Gb SAS drives I had. It's been stable so far. I have an old HP 332B SAS (aka LSI 1068E with re-badge) attached to a creaky-old Promise VTrak s610e, 16x 500Gb drives, it was "retired" from my work about a year ago. I'm sorting out the hardware right now. Two of the drives in the array went south, one of which won't even register anymore. SMART says that the remaining drives are as healthy as can be (for drives with 6+ years of spin time on them). Still making a decision on Linux RAID+LVM vs. ZFS, but I might go ZFS with the new 8.0 release. I have yet to do a memory burn-in with MemTest.

Suggestions? I'm all ears.

jcpt928

@apayne On the storage side, I've done time with FreeNAS, Nexenta, OpenFiler, etc. OpenFiler continues to be my favorite; but, it has not been updated in years. (not to mention actual SANs that I've worked with at work - DotHill, Dell\EMC, Quantum, etc.)

I am currently running a single "true-Synology" device (4x 3.5" 3TB WD Reds in SHR) for archive\backup, and, my main storage array is a home-built XPEnology appliance - 24x 2.5" 1TB WD Red drives in a RAID 6, with a 512GB SSD cache. I built this on a SuperMicro 24+2 disk array chassis (I can get the exact model if needed.). I didn't spend more than a couple hundred bucks on the chassis, and acquired almost all the disks for "free". I have been happy with both Synology and XPEnology from a capability and performance perspective - I can pull nearly 300 MB/s over my storage fabric, which isn't too bad for a home array running on RAID 6 with 30 active VMs.

I am exporting iSCSI LUNs over multiple targets (with multi-pathing); but, it also provides NFS shares (among all the other Synology capabilities). This runs over redundant storage fabric (a couple of Brocades) for 4x 1GBe uplinks at the storage side, and 4x 1GBe uplinks at my XCP-ng host. I have a couple servers for backup; but, typically run only the single main server for most workloads, and a cluster of 3x laptops running Sophos nodes on XCP-ng for my edge.

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

jcpt928

@apayne I actually haven't taken any active approaches at the hardware level to the Spectre\Meltdown bugs beyond firmware\microcode updates. The scenarios in which those can be taken advantage of aren't nearly as critical as a lot of the fuss made it out to be. Don't get me wrong, they are absolutely something to be aware of, and mitigate where possible; but, I have taken the approach of ensuring my VMs, my network, and my edge is secure - if someone can't get into something and run something that takes advantage of the bug in the first place, that's all that really matters. I think those disabling hyper-threading are going to the extreme in believing they have something that vulnerable to attack (or that worth protecting) unless they're in government, military, or research where there may actually be a valid threat vector there.

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

I'll try to slide it out of the rack and take a peek soon. I'm using generic rack trays to hold the unit, so I can't slide it out on arms and just pop the lid. When I last looked, I didn't see anything, nor is there any mention in the Dell docs. Here's a shot I took when I was checking for damage after shipping:

R815 Power & Backplane.jpg

I believe the riser on the left of the shot is where the card goes (I could be wrong, it could be on the right); and I don't see any spare plugs.

jcpt928

@apayne Yep - those risers look very different from the ones used in the 720s. The 720s have power jacks near the top on the inside end - with some splitters, you can even plug in dual-jack video cards as long as you stay under the wattage limits. The ones in your R815 actually look very similar to the ones in the 2950 IIIs.

MajorTom

@jcpt928 said in XCP-ng and NVIDIA GPUs:

I have taken the approach of ensuring my VMs, my network, and my edge is secure - if someone can't get into something and run something that takes advantage of the bug in the first place, that's all that really matters.

I seem to remember this question asked at this forum, but can't find it...

Do you use browsers?

jcpt928

@MajorTom I don't on any of the VMs providing services, no. I use a browser - one that is always up-to-date and has other security protections in place - on a specific VM designed solely for management of that environment. I would also consider myself to be a very savvy browser user. I have maybe only once or twice in 20 years come across something truly malicious, unexpectedly, while looking for something else - all other times were when I was intentionally looking for something malicious, and had taken appropriate steps otherwise. Either way, I certainly wasn't counting on just one security control at any time.

MajorTom

@jcpt928 said in XCP-ng and NVIDIA GPUs:

@MajorTom [...] I use a browser - one that is always up-to-date

0-day vulnerabilities happen.

I would also consider myself to be a very savvy browser user.

I believe. But others may be not so careful.
And these Intel bugs add some vectors of attack.
As for "I'm not a bank, nor military, nor I have got state secrets" - I hear it from time to time. But many criminals don't seek them. Many try to exploit resources owned by a victim. CPU, bandwidth, IP addresses... For spambots, mining cryptocurrrency, command&control, IP cloaking...

jcpt928

@MajorTom Oh, I'm fully aware. I'm an MCSE, a not-currently-active CISSP, and hold a handful of other certifications. I've been doing this for more than 25 years...makes me feel old. x.x

My home environment is pretty complex and extensive compared to most IT guys; but, my work environment, while impressive in its own right, is not usually something a lot of IT guys gawk at these days with the massive datacenters we're all used to. I've been lucky to end up at a business that, while under the same security requirements as those many times its size, has given me a lot of freedom to be directly and\or involved\in charge of pretty much everything from A to Z.

I have a lot of "unorthodox" IT experience as well - doing a lot with little kind of thing - hence my sometimes creative suggestions or recommendations; and, I only wish I could sell or give away half of what I have sitting on shelves in my computer lab downstairs so others can learn as much as I have.

MajorTom

@jcpt928

apayne

@olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.