Passed Through GPU Crashes Host During Driver Install
Wanted to see if anyone else has seen this, really just testing this for fun so not a big deal if it works or not, but I went through all the instructions to passthrough a GPU to XCP-ng and assign it to a Windows VM.
It showed up as it should, I downloaded the drivers which detected the right GPU, and then started the driver install, during the install of drivers on the VM (and this was repeatable 3 times) the entire host would crash, no response to pings or anything.
Any idea what would cause this? I dug through the logs some but am not seeing anything that would indicate it.
And I DO have IOUMMU enabled in the BIOS (was getting the typical errors before enabling that).
What's the hardware? Buggy IOMMU or old BIOS can trigger hardware problems.
@olivierlambert This system in specific is a Threadripper 1920X on an Asus Prime X399.
However, I'll admit I got the motherboard used so maybe something is wrong with it. I'll have to do more validation on it to see.
GPUs are pretty old too (900 series NVidia) so maybe something with one of them is triggering an issue.
What's the best place to check logs for full system hangs like this?
I would start before by upgrading all BIOS/firmware you can find. And running memtest too.
@olivierlambert So wanted to update this here.
I tried this on my other host that I know is perfectly functional without any issues (and it has been stress tested under load).
The same issue occured, the entire host crashed during driver installs on the VM.
Also not sure if it helps at all, but the GPU is initially showing up in my VM as a secondary microsoft basic display adapter, is that normal? When I did passthrough on ProxMox it showed up as the right GPU with the right name initially.
Seems pretty odd that the entire host crashes during the driver install on the VM though, in theory those things should be separate enough to not cause issues.
Well, not completely true. In the end, the whole goal of PCI passthrough is to access the hardware directly.
So there's no "layers" in between. If there's a fault when calling IOMMU or something like that, I'm less surprised that it could cause this.
Obviously, it could be a Xen bug or hardware bug, or both (ie buggy IOMMU not handled correctly by Xen).
@olivierlambert Yeah I suppose that makes sense then, interesting.
Anything in specific you'd recommend to troubleshoot? This host is already on the latest firmware and whatnot so I don't think it's an update.
In the end it's not a huge deal anyway, just was a fun project to try out.
@planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in
Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?
@andyhhp It requires manual intervention, have to go and force kill and restart the host.
So far I've seen nothing on the display output, but then again I'm only using a single GPU in this system so in theory it wouldn't show anything there anyway since it's the one being passed through, right? And you are correct I don't have a serial console or IPMI to check output.
I do have an entry in /var/crash but it's from last year so don't think it's related.
@planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.
First of all, can you update to the debug hypervisor. Adjust the
/boot/xen.gz -> $foosymlink to use the version of Xen with the
-d.gzsuffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.
Also, can you append
vga=option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.
Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding
norebootto Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.
@andyhhp Just wanted to respond real quick and say that I'll for sure go through all this, just might not be until the weekend, been a crazy week so far.
I did also want to note that this other host crashed for another unrelated reason (and produce a crash log) just yesterday. Had a Panic on CPU 0 code and a reboot.
I don't think it's likely, but maybe I somehow have 2 sets of defective hardware, I know for sure the host I'm testing on now was 100% stable before it was put in this new case and had XCP-ng installed on it, was originally a desktop of mine; doesn't mean it's not having issues now though.
Am I better off testing GPU passthrough on a system with more than 1 GPU though? I may have an additional one I can slot into this host.
Had a Panic on CPU 0 code and a reboot.
Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)
@andyhhp Will do, I'll link it here once I post it, probably can get that done today once I'm done with work lol.
Thanks for the willingness to help btw!
@andyhhp well I took way longer than I said I would, but I promise I still wanna work on this lol.
Here is the link to the thread that shows my crash reports. As a reminder, this crash happened on this host randomly and didn't seem directly related to PCI passthrough, or at least not the driver install part which I was having issues with.