apayne

apayne

@olivierlambert I think this is probably the most sensible of the schemas so far.
The dark background reduces glare, text is high-contrast, and the red actually draws the eye as a highlight or accent color. Looks sharp!

apayne

@olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.

apayne

@cg re: old CPU support, the note was already taken it just strikes me as ironic. I understand the need for less power hungry processing but in my case I’m using a hefty 260 watts idling for the server alone. I am guessing the old SAS/SATA enclosure I have rigged to it is another 300 or so at idle. That’s a lot of draw for “newer”, but the capacity is excellent so I won’t be needing more hardware. Plenty of room to spin up new VMs.

Re: Hyper-V, I already use the 2nd Gen at work via old 2012r2 installs. It’s OK and gets the job done. However I’ve been tasked with shuffling and consolidating some older installations to newer hosts/hardware, and the move process is a bit clunky for that version. Device pass through seems a bit limited.

Last time I saw VMware it was a demo lab we did at work, and it too was “just ok”, that was with the vmsphere(?) web interface, etc. However last I heard they are slowly tightening the list of supported hardware drivers, and by extension, supported hardware. That was a few years back so maybe they have added new drivers.

XCP benefits from decades of Linux device driver development. It simply boots, and that removes a lot of “officially supported hardware” headache for me. And the price is right too

apayne

@michael this may sound silly, but perhaps this isn’t a software issue? Maybe you have a faulty stick of RAM hiding in the machine? Bad RAM will make all kinds of strange and flaky things happen. This is just a guess, I’m just putting this idea out there because I’ve seen similar weird behavior in machines where RAM failure isn’t easy to spot (no front panel on the server with a fault light).

apayne

@olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.

apayne

I just tried the help feature from the program menu, because I'm wanting to see the port numbers that it uses. It appears those are not connected to anything?

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

I'll try to slide it out of the rack and take a peek soon. I'm using generic rack trays to hold the unit, so I can't slide it out on arms and just pop the lid. When I last looked, I didn't see anything, nor is there any mention in the Dell docs. Here's a shot I took when I was checking for damage after shipping:

R815 Power & Backplane.jpg

I believe the riser on the left of the shot is where the card goes (I could be wrong, it could be on the right); and I don't see any spare plugs.

apayne

@jcpt928 I plead insanity, I have a Dell R815. Honestly, the "disable Intel hyperthreading" thing is what pushed me to this AMD unit. https://github.com/speed47/spectre-meltdown-checker claims that the XCP 8.0 beta is properly patched up, "green" all the way across.

It came with 24 Cores, two drive trays, iDRAC enterprise, and a H710. Bought 12 RDIMM 4Gb ECC for $70 and shoved them in, and mirrored a boot drive with two 60Gb SAS drives I had. It's been stable so far. I have an old HP 332B SAS (aka LSI 1068E with re-badge) attached to a creaky-old Promise VTrak s610e, 16x 500Gb drives, it was "retired" from my work about a year ago. I'm sorting out the hardware right now. Two of the drives in the array went south, one of which won't even register anymore. SMART says that the remaining drives are as healthy as can be (for drives with 6+ years of spin time on them). Still making a decision on Linux RAID+LVM vs. ZFS, but I might go ZFS with the new 8.0 release. I have yet to do a memory burn-in with MemTest.

Suggestions? I'm all ears.

apayne

@jcpt928 Drat. All the others in the sub-25-watt range have much older chipsets and the AMD units are even older still (HD 4000 series). I don’t have a 1030 but I’m thinking about it.

There is another thread about hardware hiding you might want to search for.

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

essentially bars you from using anything modeled higher than a 1030;

That's good news to me in a roundabout way. I'm doing some reading/reserach on the GT 1030 DDR4, which has the right mix of low power and recent-enough CUDA cores to make it viable for my setup. Short story, the most power I can put into a video card is about 25 watts total; the 1030 DDR4 comes in at 20.

apayne

@cg re: old CPU support, the note was already taken it just strikes me as ironic. I understand the need for less power hungry processing but in my case I’m using a hefty 260 watts idling for the server alone. I am guessing the old SAS/SATA enclosure I have rigged to it is another 300 or so at idle. That’s a lot of draw for “newer”, but the capacity is excellent so I won’t be needing more hardware. Plenty of room to spin up new VMs.

Re: Hyper-V, I already use the 2nd Gen at work via old 2012r2 installs. It’s OK and gets the job done. However I’ve been tasked with shuffling and consolidating some older installations to newer hosts/hardware, and the move process is a bit clunky for that version. Device pass through seems a bit limited.

Last time I saw VMware it was a demo lab we did at work, and it too was “just ok”, that was with the vmsphere(?) web interface, etc. However last I heard they are slowly tightening the list of supported hardware drivers, and by extension, supported hardware. That was a few years back so maybe they have added new drivers.

XCP benefits from decades of Linux device driver development. It simply boots, and that removes a lot of “officially supported hardware” headache for me. And the price is right too

apayne

@jcpt928 re: gaming/media, just curious what the guest OS will be?

apayne

@jtbw911 @abdullah This is a known issue with NVidia GPU cards. NVidia considers this to be a licensing problem, because (near as anyone is able to guess) they do not want their consumer-grade graphics cards to be used in a VM. So, their drivers detect if you are in a virtual environment, and refuse to activate if it is a VM and not a physical machine.
See: https://gridforums.nvidia.com/default/topic/9108/

You can certainly activate the pass-through function in XCP and the hardware will probably pass through, but if you are using NVidia's software, it will probably reject no matter what you do. I can only speculate on their motives, but I suspect it comes down to two of them: they don't want a terminal server to use a cheaper (and less profitable) card, and they don't want a low-end GPU card to be used for the lucrative "deep learning" market, where high-end cards are sold with a hefty profit margin. This is not something I can confirm, it's just me guessing about why they did this.

There is a slim chance that open source drivers will not observe this license issue and allow the card to be used, but this directly implies you will be running some kind of Linux or FreeBSD installation, not Windows. And since that installation would not be a "pure" terminal service, or support CUDA, it would probably adhere to their license terms "in spirit" but not "to the letter".

apayne

@apayne

Best posts made by apayne

Latest posts made by apayne