XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Hailo-8L AI accellerator PCI passthrough causes xcp-ng hypervisor infinite boot-loop

    Scheduled Pinned Locked Moved Hardware
    6 Posts 2 Posters 8 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      john
      last edited by

      Hello,

      this is my first post on this forum, so I want to thank your for your work on xcp-ng.

      Failed PCI passthrough attempt:
      In my case I have problems with passing through PCI device. When I follow guide from page https://docs.xcp-ng.org/compute/ just after hiding pci device and rebooting server, hypervisor can;t boot and sticks in infinite boot loop. I had to boot it into safe mode and remove pci hide option. Then everything went back to normal.

      Success PCI passthrough:
      There is another possibility to pass through PCI device without rebooting hypervisor. This method is described on XEN page: https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough. It is called Dynamic assignment with xl.
      So when I follow xen docummentation I was able passthrough my device into VM and I can confirm that everything is working correctly. I successfully connected AI coprocessor with firgate VM.

      It would be great to fix pci passthrough with hiding pci device from Dom0. In this case I will be able to configure my VM to autostart after server reset.

      My xcp-ng version is 8.3 with all patches applied as for time of writing this post.
      My server is HP DL380 gen 9

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        Hello and welcome here!

        That's weird than just hiding the device from the Dom0 is causing an issue 🤔 Do you have any logs during the crash we can check?

        J 1 Reply Last reply Reply Quote 0
        • J Offline
          john @olivierlambert
          last edited by

          @olivierlambert

          No, but I can recreate issue and collect such logs. Where I can find this logs?

          What can I tell is that this issues was present also on xcp-ng 8.2. I thought that upgrading to 8.3 may fix this issue.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            First, let's collect the exact commands you are using to hide it from the Dom0, in case there's a typo 🙂

            J 1 Reply Last reply Reply Quote 0
            • J Offline
              john @olivierlambert
              last edited by olivierlambert

              @olivierlambert

              It wasn't my first time doing this. Previously I successfully passedthrough FibreChannel HBA to VM.
              But I understand your point. This is output form history command. I copied only interesting part:

              18  lspci | grep hailo
              19  lspci
              20  /opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(0000:08:00.0)"
              21  /opt/xensource/libexec/xen-cmdline --get-dom0 xen-pciback.hide
              22  reboot
              

              and this is output from lspci -vn

              08:00.0 0b40: 1e60:2864 (rev 01)
              	Subsystem: 1e60:2864
              	Physical Slot: 3
              	Flags: bus master, fast devsel, latency 0, IRQ 16
              	Memory at 39ff0604000 (64-bit, prefetchable) [size=16K]
              	Memory at 39ff0608000 (64-bit, prefetchable) [size=4K]
              	Memory at 39ff0600000 (64-bit, prefetchable) [size=16K]
              	Capabilities: [80] Express Endpoint, MSI 00
              	Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+
              	Capabilities: [f8] Power Management version 3
              	Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
              	Capabilities: [108] Latency Tolerance Reporting
              	Capabilities: [110] L1 PM Substates
              	Capabilities: [128] Alternative Routing-ID Interpretation (ARI)
              	Capabilities: [200] Advanced Error Reporting
              	Capabilities: [300] #19
              	Kernel driver in use: pciback
              	Kernel modules: hailo_pci
              

              As you can see there is hailo_pci kernel module (currently not used). But during my first attempts it was not present, so boot loop was caused without this driver. I only compiled it later during my debugging process.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Hmm could the module causing the crash if the device isn't accessible? 🤔

                @TeddyAstie any opinion?

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post