XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Passed Through GPU Crashes Host During Driver Install

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 2.8k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Well, not completely true. In the end, the whole goal of PCI passthrough is to access the hardware directly.

      So there's no "layers" in between. If there's a fault when calling IOMMU or something like that, I'm less surprised that it could cause this.

      Obviously, it could be a Xen bug or hardware bug, or both (ie buggy IOMMU not handled correctly by Xen).

      planedropP 1 Reply Last reply Reply Quote 0
      • planedropP Offline
        planedrop Top contributor @olivierlambert
        last edited by

        @olivierlambert Yeah I suppose that makes sense then, interesting.

        Anything in specific you'd recommend to troubleshoot? This host is already on the latest firmware and whatnot so I don't think it's an update.

        In the end it's not a huge deal anyway, just was a fun project to try out.

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by olivierlambert

          I think @andyhhp might be interested to discover how to crash a host from a VM via PCI passthrough 😄

          Any hint on how to get relevant logs in that case @andyhhp ?

          planedropP 1 Reply Last reply Reply Quote 0
          • planedropP Offline
            planedrop Top contributor @olivierlambert
            last edited by

            @olivierlambert Yes @andyhhp would be happy to help with this and provide whatever data needed, would be real cool to get it working!

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              andyhhp Xen Guru @planedrop
              last edited by

              @planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in /var/crash/ in dom0?

              Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?

              planedropP 1 Reply Last reply Reply Quote 0
              • planedropP Offline
                planedrop Top contributor @andyhhp
                last edited by

                @andyhhp It requires manual intervention, have to go and force kill and restart the host.

                So far I've seen nothing on the display output, but then again I'm only using a single GPU in this system so in theory it wouldn't show anything there anyway since it's the one being passed through, right? And you are correct I don't have a serial console or IPMI to check output.

                I do have an entry in /var/crash but it's from last year so don't think it's related.

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  andyhhp Xen Guru @planedrop
                  last edited by andyhhp

                  @planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.

                  First of all, can you update to the debug hypervisor. Adjust the /boot/xen.gz -> $foo symlink to use the version of Xen with the -d.gz suffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.

                  Also, can you append ,keep to Xen's vga= option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.

                  Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding noreboot to Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.

                  planedropP 1 Reply Last reply Reply Quote 1
                  • planedropP Offline
                    planedrop Top contributor @andyhhp
                    last edited by

                    @andyhhp Just wanted to respond real quick and say that I'll for sure go through all this, just might not be until the weekend, been a crazy week so far.

                    I did also want to note that this other host crashed for another unrelated reason (and produce a crash log) just yesterday. Had a Panic on CPU 0 code and a reboot.

                    I don't think it's likely, but maybe I somehow have 2 sets of defective hardware, I know for sure the host I'm testing on now was 100% stable before it was put in this new case and had XCP-ng installed on it, was originally a desktop of mine; doesn't mean it's not having issues now though.

                    Am I better off testing GPU passthrough on a system with more than 1 GPU though? I may have an additional one I can slot into this host.

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      andyhhp Xen Guru @planedrop
                      last edited by

                      @planedrop said in Passed Through GPU Crashes Host During Driver Install:

                      Had a Panic on CPU 0 code and a reboot.

                      Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)

                      planedropP 2 Replies Last reply Reply Quote 1
                      • planedropP Offline
                        planedrop Top contributor @andyhhp
                        last edited by

                        @andyhhp Will do, I'll link it here once I post it, probably can get that done today once I'm done with work lol.

                        Thanks for the willingness to help btw!

                        1 Reply Last reply Reply Quote 1
                        • planedropP Offline
                          planedrop Top contributor @andyhhp
                          last edited by

                          @andyhhp well I took way longer than I said I would, but I promise I still wanna work on this lol.

                          Here is the link to the thread that shows my crash reports. As a reminder, this crash happened on this host randomly and didn't seem directly related to PCI passthrough, or at least not the driver install part which I was having issues with.

                          https://xcp-ng.org/forum/topic/5900/host-crash-once-in-a-long-while

                          1 Reply Last reply Reply Quote 0
                          • P Pyroteq referenced this topic on
                          • First post
                            Last post