XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Passed Through GPU Crashes Host During Driver Install

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.8k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • planedropP Offline
      planedrop Top contributor
      last edited by

      Wanted to see if anyone else has seen this, really just testing this for fun so not a big deal if it works or not, but I went through all the instructions to passthrough a GPU to XCP-ng and assign it to a Windows VM.

      It showed up as it should, I downloaded the drivers which detected the right GPU, and then started the driver install, during the install of drivers on the VM (and this was repeatable 3 times) the entire host would crash, no response to pings or anything.

      Any idea what would cause this? I dug through the logs some but am not seeing anything that would indicate it.

      And I DO have IOUMMU enabled in the BIOS (was getting the typical errors before enabling that).

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        What's the hardware? Buggy IOMMU or old BIOS can trigger hardware problems.

        planedropP 1 Reply Last reply Reply Quote 0
        • planedropP Offline
          planedrop Top contributor @olivierlambert
          last edited by

          @olivierlambert This system in specific is a Threadripper 1920X on an Asus Prime X399.

          However, I'll admit I got the motherboard used so maybe something is wrong with it. I'll have to do more validation on it to see.

          GPUs are pretty old too (900 series NVidia) so maybe something with one of them is triggering an issue.

          What's the best place to check logs for full system hangs like this?

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            I would start before by upgrading all BIOS/firmware you can find. And running memtest too.

            planedropP 1 Reply Last reply Reply Quote 0
            • planedropP Offline
              planedrop Top contributor @olivierlambert
              last edited by

              @olivierlambert So wanted to update this here.

              I tried this on my other host that I know is perfectly functional without any issues (and it has been stress tested under load).

              The same issue occured, the entire host crashed during driver installs on the VM.

              1 Reply Last reply Reply Quote 0
              • planedropP Offline
                planedrop Top contributor
                last edited by

                Also not sure if it helps at all, but the GPU is initially showing up in my VM as a secondary microsoft basic display adapter, is that normal? When I did passthrough on ProxMox it showed up as the right GPU with the right name initially.

                Seems pretty odd that the entire host crashes during the driver install on the VM though, in theory those things should be separate enough to not cause issues.

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Well, not completely true. In the end, the whole goal of PCI passthrough is to access the hardware directly.

                  So there's no "layers" in between. If there's a fault when calling IOMMU or something like that, I'm less surprised that it could cause this.

                  Obviously, it could be a Xen bug or hardware bug, or both (ie buggy IOMMU not handled correctly by Xen).

                  planedropP 1 Reply Last reply Reply Quote 0
                  • planedropP Offline
                    planedrop Top contributor @olivierlambert
                    last edited by

                    @olivierlambert Yeah I suppose that makes sense then, interesting.

                    Anything in specific you'd recommend to troubleshoot? This host is already on the latest firmware and whatnot so I don't think it's an update.

                    In the end it's not a huge deal anyway, just was a fun project to try out.

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Offline
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by olivierlambert

                      I think @andyhhp might be interested to discover how to crash a host from a VM via PCI passthrough 😄

                      Any hint on how to get relevant logs in that case @andyhhp ?

                      planedropP 1 Reply Last reply Reply Quote 0
                      • planedropP Offline
                        planedrop Top contributor @olivierlambert
                        last edited by

                        @olivierlambert Yes @andyhhp would be happy to help with this and provide whatever data needed, would be real cool to get it working!

                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          andyhhp Xen Guru @planedrop
                          last edited by

                          @planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in /var/crash/ in dom0?

                          Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?

                          planedropP 1 Reply Last reply Reply Quote 0
                          • planedropP Offline
                            planedrop Top contributor @andyhhp
                            last edited by

                            @andyhhp It requires manual intervention, have to go and force kill and restart the host.

                            So far I've seen nothing on the display output, but then again I'm only using a single GPU in this system so in theory it wouldn't show anything there anyway since it's the one being passed through, right? And you are correct I don't have a serial console or IPMI to check output.

                            I do have an entry in /var/crash but it's from last year so don't think it's related.

                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              andyhhp Xen Guru @planedrop
                              last edited by andyhhp

                              @planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.

                              First of all, can you update to the debug hypervisor. Adjust the /boot/xen.gz -> $foo symlink to use the version of Xen with the -d.gz suffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.

                              Also, can you append ,keep to Xen's vga= option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.

                              Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding noreboot to Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.

                              planedropP 1 Reply Last reply Reply Quote 1
                              • planedropP Offline
                                planedrop Top contributor @andyhhp
                                last edited by

                                @andyhhp Just wanted to respond real quick and say that I'll for sure go through all this, just might not be until the weekend, been a crazy week so far.

                                I did also want to note that this other host crashed for another unrelated reason (and produce a crash log) just yesterday. Had a Panic on CPU 0 code and a reboot.

                                I don't think it's likely, but maybe I somehow have 2 sets of defective hardware, I know for sure the host I'm testing on now was 100% stable before it was put in this new case and had XCP-ng installed on it, was originally a desktop of mine; doesn't mean it's not having issues now though.

                                Am I better off testing GPU passthrough on a system with more than 1 GPU though? I may have an additional one I can slot into this host.

                                A 1 Reply Last reply Reply Quote 0
                                • A Offline
                                  andyhhp Xen Guru @planedrop
                                  last edited by

                                  @planedrop said in Passed Through GPU Crashes Host During Driver Install:

                                  Had a Panic on CPU 0 code and a reboot.

                                  Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)

                                  planedropP 2 Replies Last reply Reply Quote 1
                                  • planedropP Offline
                                    planedrop Top contributor @andyhhp
                                    last edited by

                                    @andyhhp Will do, I'll link it here once I post it, probably can get that done today once I'm done with work lol.

                                    Thanks for the willingness to help btw!

                                    1 Reply Last reply Reply Quote 1
                                    • planedropP Offline
                                      planedrop Top contributor @andyhhp
                                      last edited by

                                      @andyhhp well I took way longer than I said I would, but I promise I still wanna work on this lol.

                                      Here is the link to the thread that shows my crash reports. As a reminder, this crash happened on this host randomly and didn't seem directly related to PCI passthrough, or at least not the driver install part which I was having issues with.

                                      https://xcp-ng.org/forum/topic/5900/host-crash-once-in-a-long-while

                                      1 Reply Last reply Reply Quote 0
                                      • P Pyroteq referenced this topic on
                                      • First post
                                        Last post