Passed Through GPU Crashes Host During Driver Install
@olivierlambert Yeah I suppose that makes sense then, interesting.
Anything in specific you'd recommend to troubleshoot? This host is already on the latest firmware and whatnot so I don't think it's an update.
In the end it's not a huge deal anyway, just was a fun project to try out.
@planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in
Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?
@andyhhp It requires manual intervention, have to go and force kill and restart the host.
So far I've seen nothing on the display output, but then again I'm only using a single GPU in this system so in theory it wouldn't show anything there anyway since it's the one being passed through, right? And you are correct I don't have a serial console or IPMI to check output.
I do have an entry in /var/crash but it's from last year so don't think it's related.
@planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.
First of all, can you update to the debug hypervisor. Adjust the
/boot/xen.gz -> $foosymlink to use the version of Xen with the
-d.gzsuffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.
Also, can you append
vga=option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.
Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding
norebootto Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.
@andyhhp Just wanted to respond real quick and say that I'll for sure go through all this, just might not be until the weekend, been a crazy week so far.
I did also want to note that this other host crashed for another unrelated reason (and produce a crash log) just yesterday. Had a Panic on CPU 0 code and a reboot.
I don't think it's likely, but maybe I somehow have 2 sets of defective hardware, I know for sure the host I'm testing on now was 100% stable before it was put in this new case and had XCP-ng installed on it, was originally a desktop of mine; doesn't mean it's not having issues now though.
Am I better off testing GPU passthrough on a system with more than 1 GPU though? I may have an additional one I can slot into this host.
Had a Panic on CPU 0 code and a reboot.
Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)
@andyhhp Will do, I'll link it here once I post it, probably can get that done today once I'm done with work lol.
Thanks for the willingness to help btw!
@andyhhp well I took way longer than I said I would, but I promise I still wanna work on this lol.
Here is the link to the thread that shows my crash reports. As a reminder, this crash happened on this host randomly and didn't seem directly related to PCI passthrough, or at least not the driver install part which I was having issues with.