XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Google Coral TPU PCIe Passthrough Woes

    Scheduled Pinned Locked Moved Compute
    37 Posts 11 Posters 6.4k Views 12 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Hi,

      I don't know internal mechanism of the vUSB thing and why it cause this on your device (which is really a special device, with its own quirks).

      I don't remember if you already tried to passthrough a PCIe USB adapter card, then plug the USB device on it and see if it's better than vUSB?

      DustyArmstrongD 1 Reply Last reply Reply Quote 0
      • DustyArmstrongD Offline
        DustyArmstrong @olivierlambert
        last edited by

        @olivierlambert Thanks, don't worry in that case, was just to see if there was something like "oh yeah XCP does [something] with vUSBs when passing through which could explain it". The server is a mini PC so no PCIe card slots or capability unfortunately.

        I'll just live with 40ms via VirtualHere (don't know why that's so high either as others have 15-20 with that method)! It works well enough.

        1 Reply Last reply Reply Quote 0
        • S Offline
          slavox
          last edited by

          hey @andSmv @olivierlambert
          I have a PCI coral TPU and have the same issue from this thread. It doesn't look like anyone confirmed if the patch is working.

          Anything I can do to help test here? I have just switched away from proxmox so would prefer to get it working in XCP
          I'm currently on 8.3 and the alt kernel. But happy to test with whatever, I have some spare hardware to setup a dedicated test if needed.

          uname -a
          Linux xcp-long 4.19.316+1 #1 SMP Mon Aug 19 14:31:42 CEST 2024 x86_64 x86_64 x86_64 GNU/Linux
          

          xl dmesg

          (XEN) [ 3010.009205] d12v5 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0x90246 type 5
          (XEN) [ 3010.009207] d12v5 Walking EPT tables for GFN f1846:
          (XEN) [ 3010.009209] d12v5  epte 9c00000cb3924007
          (XEN) [ 3010.009210] d12v5  epte 9c0000084c552007
          (XEN) [ 3010.009211] d12v5  epte 9c00000847e9d007
          (XEN) [ 3010.009212] d12v5  epte 9c50000090246845
          (XEN) [ 3010.009214] d12v5  --- GLA 0xffffaea6c0d8d800
          (XEN) [ 3010.009219] domain_crash called from vmx_vmexit_handler+0xa8d/0x1ab0
          (XEN) [ 3010.009221] Domain 12 (vcpu#5) crashed on cpu#17:
          (XEN) [ 3010.009225] ----[ Xen-4.17.5-3  x86_64  debug=n  Not tainted ]----
          (XEN) [ 3010.009226] CPU:    17
          (XEN) [ 3010.009227] RIP:    0010:[<ffffffff8dd86326>]
          (XEN) [ 3010.009228] RFLAGS: 0000000000010286   CONTEXT: hvm guest (d12v5)
          (XEN) [ 3010.009231] rax: ffffaea6c0d8d800   rbx: ffff88c634a53800   rcx: 0000000000000000
          (XEN) [ 3010.009232] rdx: 00000000fee87000   rsi: 0000000000000000   rdi: 0000000000000000
          (XEN) [ 3010.009234] rbp: ffffaea6c0b0f448   rsp: ffffaea6c0b0f410   r8:  0000000000000000
          (XEN) [ 3010.009235] r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
          (XEN) [ 3010.009236] r12: ffffaea6c0b0f464   r13: 0000000000000011   r14: ffff88c6022860c8
          (XEN) [ 3010.009238] r15: 0000000000000087   cr0: 0000000080050033   cr4: 00000000001006f0
          (XEN) [ 3010.009239] cr3: 0000000105aca000   cr2: 00007b3046869000
          (XEN) [ 3010.009240] fsb: 000079ea9326d8c0   gsb: ffff88cb07280000   gss: 0000000000000000
          (XEN) [ 3010.009242] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
          

          lspci -vvv -s

          lspci -vvv -s 86:00.0
          86:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU (prog-if ff)
          	Subsystem: Global Unichip Corp. Coral Edge TPU
          	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
          	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
          	Latency: 0, Cache Line Size: 64 bytes
          	Interrupt: pin A routed to IRQ 56
          	Region 0: Memory at 901fc000 (64-bit, prefetchable) [size=16K]
          	Region 2: Memory at 90200000 (64-bit, prefetchable) [size=1M]
          	Capabilities: [80] Express (v2) Endpoint, MSI 00
          		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
          			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
          		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
          			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
          			MaxPayload 256 bytes, MaxReadReq 4096 bytes
          		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
          		LnkCap:	Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
          			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
          		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk-
          			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
          		LnkSta:	Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
          		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
          		DevCtl2: Completion Timeout: 260ms to 900ms, TimeoutDis-, LTR-, OBFF Disabled
          		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
          			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
          			 Compliance De-emphasis: -6dB
          		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
          			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
          	Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
          		Vector table: BAR=2 offset=00046800
          		PBA: BAR=2 offset=00046068
          	Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
          		Address: 0000000000000000  Data: 0000
          	Capabilities: [f8] Power Management version 3
          		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
          		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
          	Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
          	Capabilities: [108 v1] Latency Tolerance Reporting
          		Max snoop latency: 0ns
          		Max no snoop latency: 0ns
          	Capabilities: [110 v1] L1 PM Substates
          		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
          			  PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
          	Capabilities: [200 v2] Advanced Error Reporting
          		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
          		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
          		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC+ UnsupReq- ACSViol-
          		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
          		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
          		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
          	Kernel driver in use: pciback
          
          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            @Teddy-Astie if you have some bandwidth, can you take a look?

            1 Reply Last reply Reply Quote 0
            • TeddyAstieT Offline
              TeddyAstie Vates 🪐 XCP-ng Team Xen Guru
              last edited by TeddyAstie

              I think it is the same MSI-X/PBA issues that may be partially fixed with https://gitlab.com/xen-project/xen/-/commit/b2cd07a0447bfa25e96ae13e190225b61a3670cb

              However, with this device, MSI-X vector table and PBA are in a same page (vector table in 46800 and PBA in 46068) though, which is threated a bit differently

              If PBA lives on the same page, discard writes and log a message.
              Technically, writes outside of PBA could be allowed, but at this moment
              the precise location of PBA isn't saved, and also no known device abuses
              the spec in this way (at least yet).
              

              But Coral appears to abuse this according to DKMS driver by having more than just MSI-X and PBA on a single page
              https://github.com/google/gasket-driver/blob/main/src/apex_driver.c#L103-L140

              S 1 Reply Last reply Reply Quote 1
              • S Offline
                slavox @TeddyAstie
                last edited by

                @Teddy-Astie Is this patch already in the current kernel or do i need to manually apply it?

                TeddyAstieT 1 Reply Last reply Reply Quote 0
                • TeddyAstieT Offline
                  TeddyAstie Vates 🪐 XCP-ng Team Xen Guru @slavox
                  last edited by

                  @slavox The patch I linked is not applied to current XCP-ng.
                  But even if it was, it would still not work due to the MSI-X/PBA/registers issue in a same page I quoted previously.
                  It's not a simple issue to tackle on, but upstream Xen is aware of that and it may be solved in the future (difficult to put an ETA though).

                  R 1 Reply Last reply Reply Quote 0
                  • R Offline
                    redakula @TeddyAstie
                    last edited by

                    @Teddy-Astie

                    I think that is the patch i tested here:
                    https://xcp-ng.org/forum/topic/7066/coral-tpu-pci-passthrough/26?_=1730909872550

                    And no it made no difference...
                    I don't know if @andSmv has any more info? - ref. the thread above.

                    There seems to be a lot of work in the Xen repo on MSI stuff but i could not figure out what would be relevant for the Coral.

                    1 Reply Last reply Reply Quote 0
                    • S Offline
                      SomeFixItDude
                      last edited by

                      Having the same issue, tried all different kernels and patches with no luck. Found this thread here and have some hope. I have both the m.2 and usb versions of the coral. So I have a pcie usb 3.2 adapter card coming tomorrow and I am going to try to pass the card through instead of the m.2 and hook up the usb version see if I can get some performance. I'll let ya guys know if it is acceptable speed.

                      1 Reply Last reply Reply Quote 1
                      • S Offline
                        SomeFixItDude
                        last edited by

                        Passing through the usb adapter I am getting roughly 30ms response times which is acceptable. Definitely not as fast as the m.2 version but good enough to keep me from buying a mini-pc or SBC. I would love to switch to the m.2 if someone could post here if they get a success. Just a strange side note, I couldn't get it to list in xoa and use the gui to do the passthrough. I had to manually hide it from the dom and pass the usb adapter to the vm. I couldn't figure out how to refresh the pci list and tried doing reboots to see if it would pick up the new device but no luck. If anyone knows how to refresh that list I'd be interested.

                        Thanks,
                        SFD

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post