XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Google Coral TPU PCIe Passthrough Woes

    Scheduled Pinned Locked Moved Compute
    37 Posts 11 Posters 6.3k Views 12 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • jjggJ Offline
      jjgg @jjgg
      last edited by

      Booted into fallback and put things back the way they were. Happy to keep testing if there's additional bits to test.

      J 1 Reply Last reply Reply Quote 0
      • J Offline
        jmccoy555 @jjgg
        last edited by

        Ah, just found these things exist..... Then just found this issue exists too ☹️

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by

          If only they could have done PCI hardware that follow the PCI specifications 😢

          J jjggJ 2 Replies Last reply Reply Quote 0
          • J Offline
            jmccoy555 @olivierlambert
            last edited by

            Maybe this one will come to life again https://xcp-ng.org/forum/topic/7066/coral-tpu-pci-passthrough/14

            Don't really want to buy one knowing its not working!!

            1 Reply Last reply Reply Quote 0
            • jjggJ Offline
              jjgg @olivierlambert
              last edited by

              Definitely frustrating and no fault of xcp-ng - I have a lot of spare cpu cycles so it isn't majorly impacting me that I know of. I'm still available to test fixes though.

              Looks like most of the Proxmox users have got this working in an LXC container by installing the drivers on the host itself and passing through the actual Apex devices. Not a route that's applicable to us but just a datapoint.

              J 1 Reply Last reply Reply Quote 0
              • J Offline
                jmccoy555 @jjgg
                last edited by

                @jjgg it would be great if we could get this working. My CPU utilisation is fine too, but when I shut down my Zoneminder VM things go a lot quieter (fans) so I'm sure there would be a benefit CPU and power wise.

                NornodeN 1 Reply Last reply Reply Quote 1
                • NornodeN Offline
                  Nornode @jmccoy555
                  last edited by

                  @jmccoy555 // @jjgg

                  Did anyone of you get your Coral USB TPU working and passthrough to a VM?

                  jjggJ 1 Reply Last reply Reply Quote 0
                  • jjggJ Offline
                    jjgg @Nornode
                    last edited by

                    @Nornode hey, nope I did not. I ended up moving my infrastructure to Proxmox.

                    Honestly this is no fault of XCP-ng and XCP-ng suits my hardware / setup a lot better, but it was either that or I had two servers that needed to be bare metal.

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Offline
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by olivierlambert

                      PCI passthrough might cause problems with this device, but USB could work.

                      DustyArmstrongD 1 Reply Last reply Reply Quote 0
                      • DustyArmstrongD Offline
                        DustyArmstrong @olivierlambert
                        last edited by

                        @olivierlambert Seems like a reasonable place to ask as any - I am currently using a USB Coral over IP (Virtualhere) but would rather load it into my VM directly - what's the current status of snapshots/backups with a vUSB?

                        I've been reading that XO can now support disk exclusions with [NOBAK] but this probably doesn't apply to a Coral. Is an offline backup still the best available method?

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by

                          For NOBAK and on 8.3 yes, but I'm not sure it will be related to USB. You should use offline, that should work. Alternatively, we have plans to detect the error, to unplug the USB device, do the snap and replug it just after.

                          DustyArmstrongD 1 Reply Last reply Reply Quote 1
                          • DustyArmstrongD Offline
                            DustyArmstrong @olivierlambert
                            last edited by

                            @olivierlambert Thanks that's good to know. That functionality would be great down the line!

                            I do have a spare M.2 E-key on my XCP host running the VM Coral is needed for, but seems like I'd have trouble going by this thread. Might even have trouble with the USB Coral, it hasn't been much better so far in terms of whacky non-standard behavior...

                            DustyArmstrongD 1 Reply Last reply Reply Quote 0
                            • DustyArmstrongD Offline
                              DustyArmstrong @DustyArmstrong
                              last edited by

                              So I eventually got round to trying the USB Coral via passthrough, which worked great, but the TPU itself exhibited some behavior that made it nonviable which sucks. The USB was actually detected by XO as Google Inc. and Frigate actually loaded the TPU, but the inference speed was in excess of 180ms (it should be around 10, USB over IP it's 40). So it worked but, didn't.

                              The normal procedure with a Coral is to run a make reset from their utilities which switches the TPU back to runtime mode. This worked under my current (and now reverted) system of VirtualHere USB over IP, but it didn't work when passed through.

                              Output of make reset:

                              dfu-util: Warning: Invalid DFU suffix signature
                              dfu-util: A valid DFU suffix will be required in a future dfu-util release
                              dfu-util: No DFU capable USB device available
                              

                              It should look like this:

                              Opening DFU capable USB device...
                              Device ID 1a6e:089a
                              Device DFU version 0101
                              Claiming USB DFU Interface...
                              Setting Alternate Interface #0 ...
                              Determining device status...
                              DFU state(2) = dfuIDLE, status(0) = No error condition is present
                              DFU mode device DFU version 0101
                              Device returned transfer size 256
                              Copying data from PC to DFU device
                              Download	[=========================] 100%        10783 bytes
                              Download done.
                              DFU state(2) = dfuIDLE, status(0) = No error condition is present
                              Done!
                              Resetting USB to switch back to Run-Time mode
                              

                              Sorry to ping you @olivierlambert but would you happen to know what might cause this in XCP/XO? Is there something going on when the device is made into a vUSB that would cause it to error out/be inaccessible in DFU (I assume this means Device Firmware Update)?

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Hi,

                                I don't know internal mechanism of the vUSB thing and why it cause this on your device (which is really a special device, with its own quirks).

                                I don't remember if you already tried to passthrough a PCIe USB adapter card, then plug the USB device on it and see if it's better than vUSB?

                                DustyArmstrongD 1 Reply Last reply Reply Quote 0
                                • DustyArmstrongD Offline
                                  DustyArmstrong @olivierlambert
                                  last edited by

                                  @olivierlambert Thanks, don't worry in that case, was just to see if there was something like "oh yeah XCP does [something] with vUSBs when passing through which could explain it". The server is a mini PC so no PCIe card slots or capability unfortunately.

                                  I'll just live with 40ms via VirtualHere (don't know why that's so high either as others have 15-20 with that method)! It works well enough.

                                  1 Reply Last reply Reply Quote 0
                                  • S Offline
                                    slavox
                                    last edited by

                                    hey @andSmv @olivierlambert
                                    I have a PCI coral TPU and have the same issue from this thread. It doesn't look like anyone confirmed if the patch is working.

                                    Anything I can do to help test here? I have just switched away from proxmox so would prefer to get it working in XCP
                                    I'm currently on 8.3 and the alt kernel. But happy to test with whatever, I have some spare hardware to setup a dedicated test if needed.

                                    uname -a
                                    Linux xcp-long 4.19.316+1 #1 SMP Mon Aug 19 14:31:42 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux
                                    

                                    xl dmesg

                                    (XEN) [ 3010.009205] d12v5 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0x90246 type 5
                                    (XEN) [ 3010.009207] d12v5 Walking EPT tables for GFN f1846:
                                    (XEN) [ 3010.009209] d12v5  epte 9c00000cb3924007
                                    (XEN) [ 3010.009210] d12v5  epte 9c0000084c552007
                                    (XEN) [ 3010.009211] d12v5  epte 9c00000847e9d007
                                    (XEN) [ 3010.009212] d12v5  epte 9c50000090246845
                                    (XEN) [ 3010.009214] d12v5  --- GLA 0xffffaea6c0d8d800
                                    (XEN) [ 3010.009219] domain_crash called from vmx_vmexit_handler+0xa8d/0x1ab0
                                    (XEN) [ 3010.009221] Domain 12 (vcpu#5) crashed on cpu#17:
                                    (XEN) [ 3010.009225] ----[ Xen-4.17.5-3  x86_64  debug=n  Not tainted ]----
                                    (XEN) [ 3010.009226] CPU:    17
                                    (XEN) [ 3010.009227] RIP:    0010:[<ffffffff8dd86326>]
                                    (XEN) [ 3010.009228] RFLAGS: 0000000000010286   CONTEXT: hvm guest (d12v5)
                                    (XEN) [ 3010.009231] rax: ffffaea6c0d8d800   rbx: ffff88c634a53800   rcx: 0000000000000000
                                    (XEN) [ 3010.009232] rdx: 00000000fee87000   rsi: 0000000000000000   rdi: 0000000000000000
                                    (XEN) [ 3010.009234] rbp: ffffaea6c0b0f448   rsp: ffffaea6c0b0f410   r8:  0000000000000000
                                    (XEN) [ 3010.009235] r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
                                    (XEN) [ 3010.009236] r12: ffffaea6c0b0f464   r13: 0000000000000011   r14: ffff88c6022860c8
                                    (XEN) [ 3010.009238] r15: 0000000000000087   cr0: 0000000080050033   cr4: 00000000001006f0
                                    (XEN) [ 3010.009239] cr3: 0000000105aca000   cr2: 00007b3046869000
                                    (XEN) [ 3010.009240] fsb: 000079ea9326d8c0   gsb: ffff88cb07280000   gss: 0000000000000000
                                    (XEN) [ 3010.009242] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                                    

                                    lspci -vvv -s

                                    lspci -vvv -s 86:00.0
                                    86:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
                                    	Subsystem: Global Unichip Corp. Coral Edge TPU
                                    	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
                                    	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                                    	Latency: 0, Cache Line Size: 64 bytes
                                    	Interrupt: pin A routed to IRQ 56
                                    	Region 0: Memory at 901fc000 (64-bit, prefetchable) [size=16K]
                                    	Region 2: Memory at 90200000 (64-bit, prefetchable) [size=1M]
                                    	Capabilities: [80] Express (v2) Endpoint, MSI 00
                                    		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                                    			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
                                    		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
                                    			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                                    			MaxPayload 256 bytes, MaxReadReq 4096 bytes
                                    		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                    		LnkCap:	Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                                    			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                    		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                                    			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                    		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                                    		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                                    		DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis-, LTR-, OBFF Disabled
                                    		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                                    			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                    			 Compliance De-emphasis: -6dB
                                    		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                                    			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                                    	Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
                                    		Vector table: BAR=2 offset=00046800
                                    		PBA: BAR=2 offset=00046068
                                    	Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
                                    		Address: 0000000000000000  Data: 0000
                                    	Capabilities: [f8] Power Management version 3
                                    		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                    		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                                    	Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
                                    	Capabilities: [108 v1] Latency Tolerance Reporting
                                    		Max snoop latency: 0ns
                                    		Max no snoop latency: 0ns
                                    	Capabilities: [110 v1] L1 PM Substates
                                    		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                                    			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                                    	Capabilities: [200 v2] Advanced Error Reporting
                                    		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                                    		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                                    		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                    		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                    		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                                    		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                                    	Kernel driver in use: pciback
                                    
                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      @Teddy-Astie if you have some bandwidth, can you take a look?

                                      1 Reply Last reply Reply Quote 0
                                      • TeddyAstieT Offline
                                        TeddyAstie Vates 🪐 XCP-ng Team Xen Guru
                                        last edited by TeddyAstie

                                        I think it is the same MSI-X/PBA issues that may be partially fixed with https://gitlab.com/xen-project/xen/-/commit/b2cd07a0447bfa25e96ae13e190225b61a3670cb

                                        However, with this device, MSI-X vector table and PBA are in a same page (vector table in 46800 and PBA in 46068) though, which is threated a bit differently

                                        If PBA lives on the same page, discard writes and log a message.
                                        Technically, writes outside of PBA could be allowed, but at this moment
                                        the precise location of PBA isn't saved, and also no known device abuses
                                        the spec in this way (at least yet).
                                        

                                        But Coral appears to abuse this according to DKMS driver by having more than just MSI-X and PBA on a single page
                                        https://github.com/google/gasket-driver/blob/main/src/apex_driver.c#L103-L140

                                        S 1 Reply Last reply Reply Quote 1
                                        • S Offline
                                          slavox @TeddyAstie
                                          last edited by

                                          @Teddy-Astie Is this patch already in the current kernel or do i need to manually apply it?

                                          TeddyAstieT 1 Reply Last reply Reply Quote 0
                                          • TeddyAstieT Offline
                                            TeddyAstie Vates 🪐 XCP-ng Team Xen Guru @slavox
                                            last edited by

                                            @slavox The patch I linked is not applied to current XCP-ng.
                                            But even if it was, it would still not work due to the MSI-X/PBA/registers issue in a same page I quoted previously.
                                            It's not a simple issue to tackle on, but upstream Xen is aware of that and it may be solved in the future (difficult to put an ETA though).

                                            R 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post