Hailo-8L AI accellerator PCI passthrough causes xcp-ng hypervisor infinite boot-loop
-
Hello,
this is my first post on this forum, so I want to thank your for your work on xcp-ng.
Failed PCI passthrough attempt:
In my case I have problems with passing through PCI device. When I follow guide from page https://docs.xcp-ng.org/compute/ just after hiding pci device and rebooting server, hypervisor can;t boot and sticks in infinite boot loop. I had to boot it into safe mode and remove pci hide option. Then everything went back to normal.Success PCI passthrough:
There is another possibility to pass through PCI device without rebooting hypervisor. This method is described on XEN page: https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough. It is called Dynamic assignment with xl.
So when I follow xen docummentation I was able passthrough my device into VM and I can confirm that everything is working correctly. I successfully connected AI coprocessor with firgate VM.It would be great to fix pci passthrough with hiding pci device from Dom0. In this case I will be able to configure my VM to autostart after server reset.
My xcp-ng version is 8.3 with all patches applied as for time of writing this post.
My server is HP DL380 gen 9 -
Hello and welcome here!
That's weird than just hiding the device from the Dom0 is causing an issue
Do you have any logs during the crash we can check?
-
No, but I can recreate issue and collect such logs. Where I can find this logs?
What can I tell is that this issues was present also on xcp-ng 8.2. I thought that upgrading to 8.3 may fix this issue.
-
First, let's collect the exact commands you are using to hide it from the Dom0, in case there's a typo
-
It wasn't my first time doing this. Previously I successfully passedthrough FibreChannel HBA to VM.
But I understand your point. This is output form history command. I copied only interesting part:18 lspci | grep hailo 19 lspci 20 /opt/xensource/libexec/xen-cmdline --set-dom0 "xen-pciback.hide=(0000:08:00.0)" 21 /opt/xensource/libexec/xen-cmdline --get-dom0 xen-pciback.hide 22 reboot
and this is output from lspci -vn
08:00.0 0b40: 1e60:2864 (rev 01) Subsystem: 1e60:2864 Physical Slot: 3 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at 39ff0604000 (64-bit, prefetchable) [size=16K] Memory at 39ff0608000 (64-bit, prefetchable) [size=4K] Memory at 39ff0600000 (64-bit, prefetchable) [size=16K] Capabilities: [80] Express Endpoint, MSI 00 Capabilities: [e0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [f8] Power Management version 3 Capabilities: [100] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?> Capabilities: [108] Latency Tolerance Reporting Capabilities: [110] L1 PM Substates Capabilities: [128] Alternative Routing-ID Interpretation (ARI) Capabilities: [200] Advanced Error Reporting Capabilities: [300] #19 Kernel driver in use: pciback Kernel modules: hailo_pci
As you can see there is hailo_pci kernel module (currently not used). But during my first attempts it was not present, so boot loop was caused without this driver. I only compiled it later during my debugging process.
-
Hmm could the module causing the crash if the device isn't accessible?
@TeddyAstie any opinion?