Posts made by apayne | XCP-ng and XO forum

apayne

@michael this may sound silly, but perhaps this isn’t a software issue? Maybe you have a faulty stick of RAM hiding in the machine? Bad RAM will make all kinds of strange and flaky things happen. This is just a guess, I’m just putting this idea out there because I’ve seen similar weird behavior in machines where RAM failure isn’t easy to spot (no front panel on the server with a fault light).

apayne

@olivierlambert I did a bit of light digging.

General consensus is that Dell's servers are not ready for this kind of stuff, but then again I've seen crowds get things wrong before:
https://www.reddit.com/r/homelab/comments/6mafcg/can_i_install_a_gpu_in_a_dell_power_edge_r810/

This is the method described for KVM:
http://mathiashueber.com/fighting-error-43-how-to-use-nvidia-gpu-in-a-virtual-machine/

Additional KVM docs (plus a small description of the vendor ID problem):
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#"Error_43:_Driver_failed_to_load"_on_Nvidia_GPUs_passed_to_Windows_VMs

An updated methodology for Ryzen on-chip GPU:
http://mathiashueber.com/ryzen-based-virtual-machine-passthrough-setup-ubuntu-18-04/

This is the method described for VMWare:
http://codefromabove.com/2019/02/the-hyperconverged-homelab-windows-vm-gaming/

Hyper-V documentation is a bit more sparse, but this hints that Microsoft may have simply worked around the issue (ala vendor license agreements), at least when using RemoteFX:
http://techgenix.com/enabling-physical-gpus-hyper/

(Optional) Get CUDA working for cheap-o cards:
https://medium.com/@samnco/using-the-nvidia-gt-1030-for-cuda-workloads-on-ubuntu-16-04-4eee72d56791

So it looks like the common factors are:

The GPU device must be isolated on the host with the vfio kernel driver. To ensure this, the vfio driver must load first, prior to any vendor or open source driver.
GPU must be connected to the guest VM via PCI pass-through. No surprise.
The CPU must not be identified as a virtual one, it must have some other identity when probed. This appears to be the key to preventing the dread NVidia Error 43; it suggests the driver is just examining the CPU assigned to it, although some documentation mentions a "vendor" setting. The work-around is to make it into a string it doesn't match against, and it just works. Even a setting of "unknown" is shown to work. I don't know if there is a way to specify in a XCP guest "please don't identify yourself as virtual".
For cards that are CUDA capable but "unsupported" by NVidia, you install the software in a difference sequence (CUDA first, then driver).

Disclaimer: I'm just compiling a list to get an idea about what to do; I haven't done the actual install, nor do I have the hardware. Hopefully this helps.

apayne

I just tried the help feature from the program menu, because I'm wanting to see the port numbers that it uses. It appears those are not connected to anything?

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

Are you sure that R815 doesn't have some external GPU power connectors hidden along the PCIe backplanes?

I'll try to slide it out of the rack and take a peek soon. I'm using generic rack trays to hold the unit, so I can't slide it out on arms and just pop the lid. When I last looked, I didn't see anything, nor is there any mention in the Dell docs. Here's a shot I took when I was checking for damage after shipping:

R815 Power & Backplane.jpg

I believe the riser on the left of the shot is where the card goes (I could be wrong, it could be on the right); and I don't see any spare plugs.

apayne

@jcpt928 I plead insanity, I have a Dell R815. Honestly, the "disable Intel hyperthreading" thing is what pushed me to this AMD unit. https://github.com/speed47/spectre-meltdown-checker claims that the XCP 8.0 beta is properly patched up, "green" all the way across.

It came with 24 Cores, two drive trays, iDRAC enterprise, and a H710. Bought 12 RDIMM 4Gb ECC for $70 and shoved them in, and mirrored a boot drive with two 60Gb SAS drives I had. It's been stable so far. I have an old HP 332B SAS (aka LSI 1068E with re-badge) attached to a creaky-old Promise VTrak s610e, 16x 500Gb drives, it was "retired" from my work about a year ago. I'm sorting out the hardware right now. Two of the drives in the array went south, one of which won't even register anymore. SMART says that the remaining drives are as healthy as can be (for drives with 6+ years of spin time on them). Still making a decision on Linux RAID+LVM vs. ZFS, but I might go ZFS with the new 8.0 release. I have yet to do a memory burn-in with MemTest.

Suggestions? I'm all ears.

apayne

@jcpt928 Drat. All the others in the sub-25-watt range have much older chipsets and the AMD units are even older still (HD 4000 series). I don’t have a 1030 but I’m thinking about it.

There is another thread about hardware hiding you might want to search for.

apayne

@jcpt928 said in XCP-ng and NVIDIA GPUs:

essentially bars you from using anything modeled higher than a 1030;

That's good news to me in a roundabout way. I'm doing some reading/reserach on the GT 1030 DDR4, which has the right mix of low power and recent-enough CUDA cores to make it viable for my setup. Short story, the most power I can put into a video card is about 25 watts total; the 1030 DDR4 comes in at 20.

apayne

@cg re: old CPU support, the note was already taken it just strikes me as ironic. I understand the need for less power hungry processing but in my case I’m using a hefty 260 watts idling for the server alone. I am guessing the old SAS/SATA enclosure I have rigged to it is another 300 or so at idle. That’s a lot of draw for “newer”, but the capacity is excellent so I won’t be needing more hardware. Plenty of room to spin up new VMs.

Re: Hyper-V, I already use the 2nd Gen at work via old 2012r2 installs. It’s OK and gets the job done. However I’ve been tasked with shuffling and consolidating some older installations to newer hosts/hardware, and the move process is a bit clunky for that version. Device pass through seems a bit limited.

Last time I saw VMware it was a demo lab we did at work, and it too was “just ok”, that was with the vmsphere(?) web interface, etc. However last I heard they are slowly tightening the list of supported hardware drivers, and by extension, supported hardware. That was a few years back so maybe they have added new drivers.

XCP benefits from decades of Linux device driver development. It simply boots, and that removes a lot of “officially supported hardware” headache for me. And the price is right too

apayne

@jcpt928 re: gaming/media, just curious what the guest OS will be?

apayne

@jtbw911 @abdullah This is a known issue with NVidia GPU cards. NVidia considers this to be a licensing problem, because (near as anyone is able to guess) they do not want their consumer-grade graphics cards to be used in a VM. So, their drivers detect if you are in a virtual environment, and refuse to activate if it is a VM and not a physical machine.
See: https://gridforums.nvidia.com/default/topic/9108/

You can certainly activate the pass-through function in XCP and the hardware will probably pass through, but if you are using NVidia's software, it will probably reject no matter what you do. I can only speculate on their motives, but I suspect it comes down to two of them: they don't want a terminal server to use a cheaper (and less profitable) card, and they don't want a low-end GPU card to be used for the lucrative "deep learning" market, where high-end cards are sold with a hefty profit margin. This is not something I can confirm, it's just me guessing about why they did this.

There is a slim chance that open source drivers will not observe this license issue and allow the card to be used, but this directly implies you will be running some kind of Linux or FreeBSD installation, not Windows. And since that installation would not be a "pure" terminal service, or support CUDA, it would probably adhere to their license terms "in spirit" but not "to the letter".

apayne

@olivierlambert Go figure - just two days after that posting, my creaky old HP DL365 Gen1 up and died. CPU 1 memory bank shows a solid set of lights for RAM sockets, i.e. the memory controller can't access RAM. Bummer.

My wife took pity on me - or got me a father's day gift, depending on your point of view - and sprung for a Dell R815, which is supported. Of course, I gave the 8.0 beta a spin.

Installation was slightly faster, but there have been a few minor quirks along the way:

The USB key I stuck into the unit wasn't an option for the OS install, but did show up as an option when creating storage for VMs. No big deal, but still kinda strange.
Peeking at a different console via Alt-F2 and Alt-F3 showed some patches? And the kernel/dracut root/boot installation section took a loooong time to complete; everything else flew by. If it's downloading those patches that would explain the delay - the unit has a slow wifi bridge connection.

Other than those, it seems fine. I'm still poking around it, getting a few for what's changed.

apayne

@olivierlambert I think this is probably the most sensible of the schemas so far.
The dark background reduces glare, text is high-contrast, and the red actually draws the eye as a highlight or accent color. Looks sharp!

apayne

@DustinB said in XCP-ng 8.0.0 Beta now available!:

That's a 12 year old CPU so I don't see any issue with dropping it.

As I said, 10+ year old hardware; and I also said, I get that vendors want to draw lines in the sand so they don't end up supporting everything under the sun. It's good business sense to limit expenditures to equipment that is commonly used.

But my (poorly articulated from the last posting) point remains: there isn't a known or posted reason why the software forces me to drop the CPU. Citrix just waved their hands and said "these don't work anymore". Well, I suspect it really does work, and this is just the side-effect of a vendor cost-cutting decision for support that has nothing to do with XCP-ng, but unfortunately impacts it anyways. So it's worth a try, and if it fails, so be it - at least there will be a known reason why, instead of the Citrix response of "nothing to see here, move along..."

XCP-ng is a killer deal, probably THE killer deal when viewed through the lens of a home lab.
That makes it hard to justify shelling out money for Windows Hyper-V or VMWare when there is a family to feed and rent to pay. Maybe that explains why I am so keen on seeing if Citrix really did make changes that prevent it from running.

Fail or succeed, either way, it'll be more information to be contributed back to the community here, and something will be learned. That's a positive outcome all the way around.

apayne

It's a little strange to see older but capable CPUs dropped from the list. The old Opteron 2356 I have didn't make the cut, even though it works just fine in 7.6. I'm still going to try out 8.0 anyways.

I understand that vendors don't want to "extend support forever" but it's silly when you have 10+ year old hardware that runs fine, and the only limitation really comes down to "hardware feature XYZ is a requirement". But so far, I've not seen the actual minimum CPU requirement; which makes me a bit suspicious about "it won't run".