Google Coral TPU PCIe Passthrough Woes
-
@andSmv thanks. Completely understand this appears more of a hardware issue here, but happy to test anything. The host that these cards are installed in isn't running critical infrastructure and can be rebooted relatively easily.
@exime your initial post was a top Google result, was still helpful! Thanks.
-
@jjgg Here's the link to
xen.gz
.You need to put it in your
/boot
folder (backup your existent file!) and make sure your grub.cfg is pointing to it.But first: Backup all you want to backup! The patch is totally untested and doesn't apply as is (so I needed to adapt it). Normally not such a big deal and should not do no harm, but... you never know.
I'm also not sure that the issue would be fixed. We unfortunatelly do not have Coral TPU device at Vates, so we can't do the more deep analysis on this. The guy who wrote this patch tried to fix other device.
@exime - this is 4.13.5 XCP-ng patched xen, so there's chances it wouldn't work for you (from what I saw you're running 4.13.4 xen)
Anyway, if we have good news, we'll find the way to fix it for everybody.
-
-
Ok so testing setup.
Downloaded xen.gz, renamed to xenept.gz and put into /boot:
Updated grub.cfg to point to that file:
Rebooted host.
Got to the grub screen, let it load as normal, as soon as it disappeared (so it had made the default selection) the server rebooted.
I ended up just placing xen.gz in /boot and removing the symbolic link to the original and attempting, no difference.
Of note, messing with kernels / grub is not something I've got experience with. I may have made a mistake / need things explained in a bit more detail if I'm potentially misunderstood some instructions above.
-
Booted into fallback and put things back the way they were. Happy to keep testing if there's additional bits to test.
-
Ah, just found these things exist..... Then just found this issue exists too ️
-
If only they could have done PCI hardware that follow the PCI specifications
-
Maybe this one will come to life again https://xcp-ng.org/forum/topic/7066/coral-tpu-pci-passthrough/14
Don't really want to buy one knowing its not working!!
-
Definitely frustrating and no fault of xcp-ng - I have a lot of spare cpu cycles so it isn't majorly impacting me that I know of. I'm still available to test fixes though.
Looks like most of the Proxmox users have got this working in an LXC container by installing the drivers on the host itself and passing through the actual Apex devices. Not a route that's applicable to us but just a datapoint.
-
@jjgg it would be great if we could get this working. My CPU utilisation is fine too, but when I shut down my Zoneminder VM things go a lot quieter (fans) so I'm sure there would be a benefit CPU and power wise.
-
@jmccoy555 // @jjgg
Did anyone of you get your Coral USB TPU working and passthrough to a VM?
-
@Nornode hey, nope I did not. I ended up moving my infrastructure to Proxmox.
Honestly this is no fault of XCP-ng and XCP-ng suits my hardware / setup a lot better, but it was either that or I had two servers that needed to be bare metal.
-
PCI passthrough might cause problems with this device, but USB could work.
-
@olivierlambert Seems like a reasonable place to ask as any - I am currently using a USB Coral over IP (Virtualhere) but would rather load it into my VM directly - what's the current status of snapshots/backups with a vUSB?
I've been reading that XO can now support disk exclusions with
[NOBAK]
but this probably doesn't apply to a Coral. Is an offline backup still the best available method? -
For NOBAK and on 8.3 yes, but I'm not sure it will be related to USB. You should use offline, that should work. Alternatively, we have plans to detect the error, to unplug the USB device, do the snap and replug it just after.
-
@olivierlambert Thanks that's good to know. That functionality would be great down the line!
I do have a spare M.2 E-key on my XCP host running the VM Coral is needed for, but seems like I'd have trouble going by this thread. Might even have trouble with the USB Coral, it hasn't been much better so far in terms of whacky non-standard behavior...
-
So I eventually got round to trying the USB Coral via passthrough, which worked great, but the TPU itself exhibited some behavior that made it nonviable which sucks. The USB was actually detected by XO as
Google Inc.
and Frigate actually loaded the TPU, but the inference speed was in excess of 180ms (it should be around 10, USB over IP it's 40). So it worked but, didn't.The normal procedure with a Coral is to run a
make reset
from their utilities which switches the TPU back to runtime mode. This worked under my current (and now reverted) system of VirtualHere USB over IP, but it didn't work when passed through.Output of
make reset
:dfu-util: Warning: Invalid DFU suffix signature dfu-util: A valid DFU suffix will be required in a future dfu-util release dfu-util: No DFU capable USB device available
It should look like this:
Opening DFU capable USB device... Device ID 1a6e:089a Device DFU version 0101 Claiming USB DFU Interface... Setting Alternate Interface #0 ... Determining device status... DFU state(2) = dfuIDLE, status(0) = No error condition is present DFU mode device DFU version 0101 Device returned transfer size 256 Copying data from PC to DFU device Download [=========================] 100% 10783 bytes Download done. DFU state(2) = dfuIDLE, status(0) = No error condition is present Done! Resetting USB to switch back to Run-Time mode
Sorry to ping you @olivierlambert but would you happen to know what might cause this in XCP/XO? Is there something going on when the device is made into a vUSB that would cause it to error out/be inaccessible in DFU (I assume this means Device Firmware Update)?
-
Hi,
I don't know internal mechanism of the vUSB thing and why it cause this on your device (which is really a special device, with its own quirks).
I don't remember if you already tried to passthrough a PCIe USB adapter card, then plug the USB device on it and see if it's better than vUSB?
-
@olivierlambert Thanks, don't worry in that case, was just to see if there was something like "oh yeah XCP does [something] with vUSBs when passing through which could explain it". The server is a mini PC so no PCIe card slots or capability unfortunately.
I'll just live with 40ms via VirtualHere (don't know why that's so high either as others have 15-20 with that method)! It works well enough.
-
hey @andSmv @olivierlambert
I have a PCI coral TPU and have the same issue from this thread. It doesn't look like anyone confirmed if the patch is working.Anything I can do to help test here? I have just switched away from proxmox so would prefer to get it working in XCP
I'm currently on 8.3 and the alt kernel. But happy to test with whatever, I have some spare hardware to setup a dedicated test if needed.uname -a Linux xcp-long 4.19.316+1 #1 SMP Mon Aug 19 14:31:42 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux
xl dmesg
(XEN) [ 3010.009205] d12v5 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0x90246 type 5 (XEN) [ 3010.009207] d12v5 Walking EPT tables for GFN f1846: (XEN) [ 3010.009209] d12v5 epte 9c00000cb3924007 (XEN) [ 3010.009210] d12v5 epte 9c0000084c552007 (XEN) [ 3010.009211] d12v5 epte 9c00000847e9d007 (XEN) [ 3010.009212] d12v5 epte 9c50000090246845 (XEN) [ 3010.009214] d12v5 --- GLA 0xffffaea6c0d8d800 (XEN) [ 3010.009219] domain_crash called from vmx_vmexit_handler+0xa8d/0x1ab0 (XEN) [ 3010.009221] Domain 12 (vcpu#5) crashed on cpu#17: (XEN) [ 3010.009225] ----[ Xen-4.17.5-3 x86_64 debug=n Not tainted ]---- (XEN) [ 3010.009226] CPU: 17 (XEN) [ 3010.009227] RIP: 0010:[<ffffffff8dd86326>] (XEN) [ 3010.009228] RFLAGS: 0000000000010286 CONTEXT: hvm guest (d12v5) (XEN) [ 3010.009231] rax: ffffaea6c0d8d800 rbx: ffff88c634a53800 rcx: 0000000000000000 (XEN) [ 3010.009232] rdx: 00000000fee87000 rsi: 0000000000000000 rdi: 0000000000000000 (XEN) [ 3010.009234] rbp: ffffaea6c0b0f448 rsp: ffffaea6c0b0f410 r8: 0000000000000000 (XEN) [ 3010.009235] r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 (XEN) [ 3010.009236] r12: ffffaea6c0b0f464 r13: 0000000000000011 r14: ffff88c6022860c8 (XEN) [ 3010.009238] r15: 0000000000000087 cr0: 0000000080050033 cr4: 00000000001006f0 (XEN) [ 3010.009239] cr3: 0000000105aca000 cr2: 00007b3046869000 (XEN) [ 3010.009240] fsb: 000079ea9326d8c0 gsb: ffff88cb07280000 gss: 0000000000000000 (XEN) [ 3010.009242] ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0018 cs: 0010
lspci -vvv -s
lspci -vvv -s 86:00.0 86:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff) Subsystem: Global Unichip Corp. Coral Edge TPU Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 56 Region 0: Memory at 901fc000 (64-bit, prefetchable) [size=16K] Region 2: Memory at 90200000 (64-bit, prefetchable) [size=1M] Capabilities: [80] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 4096 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [d0] MSI-X: Enable- Count=128 Masked- Vector table: BAR=2 offset=00046800 PBA: BAR=2 offset=00046068 Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [f8] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?> Capabilities: [108 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Capabilities: [110 v1] L1 PM Substates L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+ PortCommonModeRestoreTime=10us PortTPowerOnTime=10us Capabilities: [200 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC+ UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Kernel driver in use: pciback