XCP-ng and NVIDIA GPUs

apayne

@jtbw911 @abdullah This is a known issue with NVidia GPU cards. NVidia considers this to be a licensing problem, because (near as anyone is able to guess) they do not want their consumer-grade graphics cards to be used in a VM. So, their drivers detect if you are in a virtual environment, and refuse to activate if it is a VM and not a physical machine.
See: https://gridforums.nvidia.com/default/topic/9108/

You can certainly activate the pass-through function in XCP and the hardware will probably pass through, but if you are using NVidia's software, it will probably reject no matter what you do. I can only speculate on their motives, but I suspect it comes down to two of them: they don't want a terminal server to use a cheaper (and less profitable) card, and they don't want a low-end GPU card to be used for the lucrative "deep learning" market, where high-end cards are sold with a hefty profit margin. This is not something I can confirm, it's just me guessing about why they did this.

There is a slim chance that open source drivers will not observe this license issue and allow the card to be used, but this directly implies you will be running some kind of Linux or FreeBSD installation, not Windows. And since that installation would not be a "pure" terminal service, or support CUDA, it would probably adhere to their license terms "in spirit" but not "to the letter".

jcpt928

@apayne I am fully aware that this is a known issue. That portion of my question was more in alignment with "does the XCP team perhaps have any plans to implement a similar feature in XCP, that KVM has, that can "hide" the virtualization state from the VM itself?". I believe certain versions of vmWare also have this feature now. I know this plays into HVM vs. PV status of a VM; but, with KVM being able to implement something, perhaps the possibility for XCP-ng to come up with a solution is there as well. For the record, this is not for any AI or Terminal Services use - I'm looking more into the virtualized gaming\media space.

olivierlambert

@jcpt928 please provide more details on how it's done to be able to understand what's needed.

apayne

@jcpt928 re: gaming/media, just curious what the guest OS will be?

jcpt928

@olivierlambert There are a number of different ways it can be done, evidently; however, there are some inherent variables involved that are way beyond my scope of expertise. I've linked a couple articles\posts below that go into more detail, and, also provide additional resources discussing this topic. I find it very intriguing; but, it certainly can become the deep end of the pool very quickly.

https://forum.proxmox.com/threads/hide-vm-from-guest.34905/
https://stackoverflow.com/questions/154163/detect-virtualized-os-from-an-application
https://kb.vmware.com/s/article/1009458

I believe the second link provides the most additional resources discussing "No Pill\Red Pill\Blue Pill" scenarios. The third link is vmWare's official resource on detection, the first one is a thread discussing the possibilities of hiding the hypervisor from a VM. It seems, according to one thread response on the first link, that KVM may mask the hypervisor present bit as its way of hiding the VM (at least it starts with that).

@apayne Honestly, it could be Linux or Windows for the gaming side - doesn't really matter (although, with Steam, Windows is preferred), as the concept is to use it with something like a Steam box or mobile client akin to Steam's streaming client. The ideal scenario is still to have some good "oomph" in the GPU on the back-end. NVIDIA's driver implementation from a VM essentially bars you from using anything modeled higher than a 1030; and, I've read a few places that finding stable older drivers for even their 900 series can be difficult from a VM perspective. AMD is an option, of course; but, the idle [and under load] power consumption and cooling requirements of an AMD card vs. an NVIDIA are quite different - especially in a server setting (even moreso in a cost-sensitive home environment - which is part of my use case). My server runs 24x7 regardless, and is battery and generator-backed. My gaming rig consumes as much electricity on its own as my entire server, switching, and storage infrastructure in a half-rack. If I can utilize the already-on server infrastructure to provide even half of my gaming needs, the power savings are well worth it.

For the media side, it could still be Linux or Windows; but, my use case revolves around time\latency-sensitive media encoding\decoding (think Plex and\or DVR for reference) of very high quality video - a GPU would do wonders for this.

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

essentially bars you from using anything modeled higher than a 1030;

That's good news to me in a roundabout way. I'm doing some reading/reserach on the GT 1030 DDR4, which has the right mix of low power and recent-enough CUDA cores to make it viable for my setup. Short story, the most power I can put into a video card is about 25 watts total; the 1030 DDR4 comes in at 20.

jcpt928

@apayne Oh, no. That includes the 1030. You can't get non-virtualization-detecting drivers for anything 1030 and above. I guess the way I typed that the first time could come across as "only those above the 1030". I've tried the 1030, the 1060 [Ti], and the 1070 to no avail. I'm quite certain the 1080 [Ti] and 1050 [Ti] aren't going to be any different than the others.

I'm currently using a couple of spare Radeon 5600 series in my server for testing - the power consumption vs. performance is not going to be worth it at all.

apayne

@jcpt928 Drat. All the others in the sub-25-watt range have much older chipsets and the AMD units are even older still (HD 4000 series). I don’t have a 1030 but I’m thinking about it.

There is another thread about hardware hiding you might want to search for.

jcpt928

@apayne Yep. I ran into this issue both at work and at home - we got a really sweet deal on some 1060 Tis - tried to use them in some VMs, came to the realization that NVIDIA had locked them out in the drivers. We IT guys at least got a "nice" GPU out of it in the end - I use mine alongside my 1070 as a dedicated PhysX GPU, that also drives a couple secondary monitors for social stuff and hardware monitoring. I'll see if I can find that thread.

A side note - if you get yourself something like a Dell R720 (or most of their other 2U servers), then you will have ports for external GPU power. You'll still be limited wattage-wise; but, to a lesser extent.

apayne

@jcpt928 I plead insanity, I have a Dell R815. Honestly, the "disable Intel hyperthreading" thing is what pushed me to this AMD unit. https://github.com/speed47/spectre-meltdown-checker claims that the XCP 8.0 beta is properly patched up, "green" all the way across.

It came with 24 Cores, two drive trays, iDRAC enterprise, and a H710. Bought 12 RDIMM 4Gb ECC for $70 and shoved them in, and mirrored a boot drive with two 60Gb SAS drives I had. It's been stable so far. I have an old HP 332B SAS (aka LSI 1068E with re-badge) attached to a creaky-old Promise VTrak s610e, 16x 500Gb drives, it was "retired" from my work about a year ago. I'm sorting out the hardware right now. Two of the drives in the array went south, one of which won't even register anymore. SMART says that the remaining drives are as healthy as can be (for drives with 6+ years of spin time on them). Still making a decision on Linux RAID+LVM vs. ZFS, but I might go ZFS with the new 8.0 release. I have yet to do a memory burn-in with MemTest.

Suggestions? I'm all ears.

jcpt928

@apayne On the storage side, I've done time with FreeNAS, Nexenta, OpenFiler, etc. OpenFiler continues to be my favorite; but, it has not been updated in years. (not to mention actual SANs that I've worked with at work - DotHill, Dell\EMC, Quantum, etc.)

I am currently running a single "true-Synology" device (4x 3.5" 3TB WD Reds in SHR) for archive\backup, and, my main storage array is a home-built XPEnology appliance - 24x 2.5" 1TB WD Red drives in a RAID 6, with a 512GB SSD cache. I built this on a SuperMicro 24+2 disk array chassis (I can get the exact model if needed.). I didn't spend more than a couple hundred bucks on the chassis, and acquired almost all the disks for "free". I have been happy with both Synology and XPEnology from a capability and performance perspective - I can pull nearly 300 MB/s over my storage fabric, which isn't too bad for a home array running on RAID 6 with 30 active VMs.

I am exporting iSCSI LUNs over multiple targets (with multi-pathing); but, it also provides NFS shares (among all the other Synology capabilities). This runs over redundant storage fabric (a couple of Brocades) for 4x 1GBe uplinks at the storage side, and 4x 1GBe uplinks at my XCP-ng host. I have a couple servers for backup; but, typically run only the single main server for most workloads, and a cluster of 3x laptops running Sophos nodes on XCP-ng for my edge.

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

jcpt928

@apayne I actually haven't taken any active approaches at the hardware level to the Spectre\Meltdown bugs beyond firmware\microcode updates. The scenarios in which those can be taken advantage of aren't nearly as critical as a lot of the fuss made it out to be. Don't get me wrong, they are absolutely something to be aware of, and mitigate where possible; but, I have taken the approach of ensuring my VMs, my network, and my edge is secure - if someone can't get into something and run something that takes advantage of the bug in the first place, that's all that really matters. I think those disabling hyper-threading are going to the extreme in believing they have something that vulnerable to attack (or that worth protecting) unless they're in government, military, or research where there may actually be a valid threat vector there.

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

I'll try to slide it out of the rack and take a peek soon. I'm using generic rack trays to hold the unit, so I can't slide it out on arms and just pop the lid. When I last looked, I didn't see anything, nor is there any mention in the Dell docs. Here's a shot I took when I was checking for damage after shipping:

R815 Power & Backplane.jpg

I believe the riser on the left of the shot is where the card goes (I could be wrong, it could be on the right); and I don't see any spare plugs.

jcpt928

@apayne Yep - those risers look very different from the ones used in the 720s. The 720s have power jacks near the top on the inside end - with some splitters, you can even plug in dual-jack video cards as long as you stay under the wattage limits. The ones in your R815 actually look very similar to the ones in the 2950 IIIs.

MajorTom

@jcpt928 said in XCP-ng and NVIDIA GPUs:

I have taken the approach of ensuring my VMs, my network, and my edge is secure - if someone can't get into something and run something that takes advantage of the bug in the first place, that's all that really matters.

I seem to remember this question asked at this forum, but can't find it...

Do you use browsers?

jcpt928

@MajorTom I don't on any of the VMs providing services, no. I use a browser - one that is always up-to-date and has other security protections in place - on a specific VM designed solely for management of that environment. I would also consider myself to be a very savvy browser user. I have maybe only once or twice in 20 years come across something truly malicious, unexpectedly, while looking for something else - all other times were when I was intentionally looking for something malicious, and had taken appropriate steps otherwise. Either way, I certainly wasn't counting on just one security control at any time.

MajorTom

@jcpt928 said in XCP-ng and NVIDIA GPUs:

@MajorTom [...] I use a browser - one that is always up-to-date

0-day vulnerabilities happen.

I would also consider myself to be a very savvy browser user.

I believe. But others may be not so careful.
And these Intel bugs add some vectors of attack.
As for "I'm not a bank, nor military, nor I have got state secrets" - I hear it from time to time. But many criminals don't seek them. Many try to exploit resources owned by a victim. CPU, bandwidth, IP addresses... For spambots, mining cryptocurrrency, command&control, IP cloaking...

jcpt928

@MajorTom Oh, I'm fully aware. I'm an MCSE, a not-currently-active CISSP, and hold a handful of other certifications. I've been doing this for more than 25 years...makes me feel old. x.x

My home environment is pretty complex and extensive compared to most IT guys; but, my work environment, while impressive in its own right, is not usually something a lot of IT guys gawk at these days with the massive datacenters we're all used to. I've been lucky to end up at a business that, while under the same security requirements as those many times its size, has given me a lot of freedom to be directly and\or involved\in charge of pretty much everything from A to Z.

I have a lot of "unorthodox" IT experience as well - doing a lot with little kind of thing - hence my sometimes creative suggestions or recommendations; and, I only wish I could sell or give away half of what I have sitting on shelves in my computer lab downstairs so others can learn as much as I have.

MajorTom

@jcpt928

apayne

@olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.