XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Passed Through GPU Crashes Host During Driver Install

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.5k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      What's the hardware? Buggy IOMMU or old BIOS can trigger hardware problems.

      planedropP 1 Reply Last reply Reply Quote 0
      • planedropP Offline
        planedrop Top contributor @olivierlambert
        last edited by

        @olivierlambert This system in specific is a Threadripper 1920X on an Asus Prime X399.

        However, I'll admit I got the motherboard used so maybe something is wrong with it. I'll have to do more validation on it to see.

        GPUs are pretty old too (900 series NVidia) so maybe something with one of them is triggering an issue.

        What's the best place to check logs for full system hangs like this?

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          I would start before by upgrading all BIOS/firmware you can find. And running memtest too.

          planedropP 1 Reply Last reply Reply Quote 0
          • planedropP Offline
            planedrop Top contributor @olivierlambert
            last edited by

            @olivierlambert So wanted to update this here.

            I tried this on my other host that I know is perfectly functional without any issues (and it has been stress tested under load).

            The same issue occured, the entire host crashed during driver installs on the VM.

            1 Reply Last reply Reply Quote 0
            • planedropP Offline
              planedrop Top contributor
              last edited by

              Also not sure if it helps at all, but the GPU is initially showing up in my VM as a secondary microsoft basic display adapter, is that normal? When I did passthrough on ProxMox it showed up as the right GPU with the right name initially.

              Seems pretty odd that the entire host crashes during the driver install on the VM though, in theory those things should be separate enough to not cause issues.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Well, not completely true. In the end, the whole goal of PCI passthrough is to access the hardware directly.

                So there's no "layers" in between. If there's a fault when calling IOMMU or something like that, I'm less surprised that it could cause this.

                Obviously, it could be a Xen bug or hardware bug, or both (ie buggy IOMMU not handled correctly by Xen).

                planedropP 1 Reply Last reply Reply Quote 0
                • planedropP Offline
                  planedrop Top contributor @olivierlambert
                  last edited by

                  @olivierlambert Yeah I suppose that makes sense then, interesting.

                  Anything in specific you'd recommend to troubleshoot? This host is already on the latest firmware and whatnot so I don't think it's an update.

                  In the end it's not a huge deal anyway, just was a fun project to try out.

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by olivierlambert

                    I think @andyhhp might be interested to discover how to crash a host from a VM via PCI passthrough 😄

                    Any hint on how to get relevant logs in that case @andyhhp ?

                    planedropP 1 Reply Last reply Reply Quote 0
                    • planedropP Offline
                      planedrop Top contributor @olivierlambert
                      last edited by

                      @olivierlambert Yes @andyhhp would be happy to help with this and provide whatever data needed, would be real cool to get it working!

                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        andyhhp Xen Guru @planedrop
                        last edited by

                        @planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in /var/crash/ in dom0?

                        Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?

                        planedropP 1 Reply Last reply Reply Quote 0
                        • planedropP Offline
                          planedrop Top contributor @andyhhp
                          last edited by

                          @andyhhp It requires manual intervention, have to go and force kill and restart the host.

                          So far I've seen nothing on the display output, but then again I'm only using a single GPU in this system so in theory it wouldn't show anything there anyway since it's the one being passed through, right? And you are correct I don't have a serial console or IPMI to check output.

                          I do have an entry in /var/crash but it's from last year so don't think it's related.

                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            andyhhp Xen Guru @planedrop
                            last edited by andyhhp

                            @planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.

                            First of all, can you update to the debug hypervisor. Adjust the /boot/xen.gz -> $foo symlink to use the version of Xen with the -d.gz suffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.

                            Also, can you append ,keep to Xen's vga= option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.

                            Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding noreboot to Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.

                            planedropP 1 Reply Last reply Reply Quote 1
                            • planedropP Offline
                              planedrop Top contributor @andyhhp
                              last edited by

                              @andyhhp Just wanted to respond real quick and say that I'll for sure go through all this, just might not be until the weekend, been a crazy week so far.

                              I did also want to note that this other host crashed for another unrelated reason (and produce a crash log) just yesterday. Had a Panic on CPU 0 code and a reboot.

                              I don't think it's likely, but maybe I somehow have 2 sets of defective hardware, I know for sure the host I'm testing on now was 100% stable before it was put in this new case and had XCP-ng installed on it, was originally a desktop of mine; doesn't mean it's not having issues now though.

                              Am I better off testing GPU passthrough on a system with more than 1 GPU though? I may have an additional one I can slot into this host.

                              A 1 Reply Last reply Reply Quote 0
                              • A Offline
                                andyhhp Xen Guru @planedrop
                                last edited by

                                @planedrop said in Passed Through GPU Crashes Host During Driver Install:

                                Had a Panic on CPU 0 code and a reboot.

                                Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)

                                planedropP 2 Replies Last reply Reply Quote 1
                                • planedropP Offline
                                  planedrop Top contributor @andyhhp
                                  last edited by

                                  @andyhhp Will do, I'll link it here once I post it, probably can get that done today once I'm done with work lol.

                                  Thanks for the willingness to help btw!

                                  1 Reply Last reply Reply Quote 1
                                  • planedropP Offline
                                    planedrop Top contributor @andyhhp
                                    last edited by

                                    @andyhhp well I took way longer than I said I would, but I promise I still wanna work on this lol.

                                    Here is the link to the thread that shows my crash reports. As a reminder, this crash happened on this host randomly and didn't seem directly related to PCI passthrough, or at least not the driver install part which I was having issues with.

                                    https://xcp-ng.org/forum/topic/5900/host-crash-once-in-a-long-while

                                    1 Reply Last reply Reply Quote 0
                                    • P Pyroteq referenced this topic on
                                    • First post
                                      Last post