XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Hailo-8L AI accellerator PCI passthrough causes xcp-ng hypervisor infinite boot-loop

    Scheduled Pinned Locked Moved Hardware
    8 Posts 4 Posters 1.1k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      john
      last edited by

      Hello,

      this is my first post on this forum, so I want to thank your for your work on xcp-ng.

      Failed PCI passthrough attempt:
      In my case I have problems with passing through PCI device. When I follow guide from page https://docs.xcp-ng.org/compute/ just after hiding pci device and rebooting server, hypervisor can;t boot and sticks in infinite boot loop. I had to boot it into safe mode and remove pci hide option. Then everything went back to normal.

      Success PCI passthrough:
      There is another possibility to pass through PCI device without rebooting hypervisor. This method is described on XEN page: https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough. It is called Dynamic assignment with xl.
      So when I follow xen docummentation I was able passthrough my device into VM and I can confirm that everything is working correctly. I successfully connected AI coprocessor with firgate VM.

      It would be great to fix pci passthrough with hiding pci device from Dom0. In this case I will be able to configure my VM to autostart after server reset.

      My xcp-ng version is 8.3 with all patches applied as for time of writing this post.
      My server is HP DL380 gen 9

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        Hello and welcome here!

        That's weird than just hiding the device from the Dom0 is causing an issue 🤔 Do you have any logs during the crash we can check?

        J 1 Reply Last reply Reply Quote 0
        • J Offline
          john @olivierlambert
          last edited by

          @olivierlambert

          No, but I can recreate issue and collect such logs. Where I can find this logs?

          What can I tell is that this issues was present also on xcp-ng 8.2. I thought that upgrading to 8.3 may fix this issue.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            First, let's collect the exact commands you are using to hide it from the Dom0, in case there's a typo 🙂

            J 1 Reply Last reply Reply Quote 0
            • J Offline
              john @olivierlambert
              last edited by olivierlambert

              @olivierlambert

              It wasn't my first time doing this. Previously I successfully passedthrough FibreChannel HBA to VM.
              But I understand your point. This is output form history command. I copied only interesting part:

              18  lspci | grep hailo
              19  lspci
              20  /opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(0000:08:00.0)"
              21  /opt/xensource/libexec/xen-cmdline --get-dom0 xen-pciback.hide
              22  reboot
              

              and this is output from lspci -vn

              08:00.0 0b40: 1e60:2864 (rev 01)
              	Subsystem: 1e60:2864
              	Physical Slot: 3
              	Flags: bus master, fast devsel, latency 0, IRQ 16
              	Memory at 39ff0604000 (64-bit, prefetchable) [size=16K]
              	Memory at 39ff0608000 (64-bit, prefetchable) [size=4K]
              	Memory at 39ff0600000 (64-bit, prefetchable) [size=16K]
              	Capabilities: [80] Express Endpoint, MSI 00
              	Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+
              	Capabilities: [f8] Power Management version 3
              	Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
              	Capabilities: [108] Latency Tolerance Reporting
              	Capabilities: [110] L1 PM Substates
              	Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
              	Capabilities: [200] Advanced Error Reporting
              	Capabilities: [300] #19
              	Kernel driver in use: pciback
              	Kernel modules: hailo_pci
              

              As you can see there is hailo_pci kernel module (currently not used). But during my first attempts it was not present, so boot loop was caused without this driver. I only compiled it later during my debugging process.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Hmm could the module causing the crash if the device isn't accessible? 🤔

                @TeddyAstie any opinion?

                R 1 Reply Last reply Reply Quote 0
                • R Offline
                  redakula @olivierlambert
                  last edited by

                  @olivierlambert said in Hailo-8L AI accellerator PCI passthrough causes xcp-ng hypervisor infinite boot-loop:

                  Hmm could the module causing the crash if the device isn't accessible? 🤔

                  A quick google found this thread on a proxmox forum - according to that user it causes a hotplug event when it initializes.
                  https://forum.proxmox.com/threads/hailo-8-ai-m-2-card-crashes-server-when-using-passthrough.166428/

                  Seems these AI cards are a bit of a pain - remembering the continuing issue with Google Coral pcie cards

                  1 Reply Last reply Reply Quote 0
                  • TeddyAstieT Offline
                    TeddyAstie Vates 🪐 XCP-ng Team Xen Guru
                    last edited by

                    I've seen cases where the a hard reset is forced in case some devices can't DMA. Maybe it's related.
                    If that's the case, something should show up in the IPMI, and the crash is usually instantaneous; otherwise, there is some delay (~5 seconds) between Xen/Dom0 crash and actual reboot.

                    1 Reply Last reply Reply Quote 0

                    Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                    Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                    With your input, this post could be even better 💗

                    Register Login
                    • First post
                      Last post