XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Coral TPU PCI Passthrough

    Compute
    3
    15
    210
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      logical.systems
      last edited by

      Anyone ever get PCI passthrough to work for Coral TPU?
      After enabling the passthrough the Debian 11 VM won't start and the log shows:
      unix enoent open error.

      [12:48 xcp-lab0 ~]# lspci -vvv -s 04:00.0
      04:00.0 Non-VGA unclassified device: Global Unichip Corp. Coral Edge TPU (prog-if ff)
              Subsystem: Global Unichip Corp. Coral Edge TPU
              Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
              Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
              Interrupt: pin A routed to IRQ 255
              Region 0: Memory at 4000100000 (64-bit, prefetchable) [disabled] [size=16K]
              Region 2: Memory at 4000000000 (64-bit, prefetchable) [disabled] [size=1M]
              Capabilities: [80] Express (v2) Endpoint, MSI 00
                      DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                              ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                      DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                              RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                              MaxPayload 256 bytes, MaxReadReq 512 bytes
                      DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                      LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                              ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                      LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                              ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                      LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                      DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                      DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                      LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                               Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                               Compliance De-emphasis: -6dB
                      LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                               EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
              Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
                      Vector table: BAR=2 offset=00046800
                      PBA: BAR=2 offset=00046068
              Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
                      Address: 0000000000000000  Data: 0000
              Capabilities: [f8] Power Management version 3
                      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                      Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
              Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
              Capabilities: [108 v1] Latency Tolerance Reporting
                      Max snoop latency: 3145728ns
                      Max no snoop latency: 3145728ns
              Capabilities: [110 v1] L1 PM Substates
                      L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                                PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
              Capabilities: [200 v2] Advanced Error Reporting
                      UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                      UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                      UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC- UnsupReq- ACSViol-
                      CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                      CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                      AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
      
      [12:48 xcp-lab0 ~]#
      
      1 Reply Last reply Reply Quote 0
      • olivierlambertO
        olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
        last edited by

        I've read that those TPU are breaking the PCI specification, and therefore having issues when you do PCI passthrough with them. Maybe it was on the forum, last year or two 🤔

        1 Reply Last reply Reply Quote 0
        • olivierlambertO
          olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
          last edited by

          Found it!

          https://xcp-ng.org/forum/topic/6304/google-coral-tpu-pcie-passthrough-woes/

          L 1 Reply Last reply Reply Quote 0
          • L
            logical.systems @olivierlambert
            last edited by

            @olivierlambert I saw that post in my initial search but it doesn't look like the OP replied with the PCI dump. Is there any hope for a workaround?

            1 Reply Last reply Reply Quote 0
            • olivierlambertO
              olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
              last edited by

              You can continue here to provide details, maybe we'll see something obvious 🙂

              L 1 Reply Last reply Reply Quote 0
              • L
                logical.systems @olivierlambert
                last edited by

                @olivierlambert Aside from the dump in my original post would you like me to run any additional commands to gather more data?

                1 Reply Last reply Reply Quote 0
                • olivierlambertO
                  olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                  last edited by

                  @andSmv will take a look when he's around 🙂

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO
                    olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                    last edited by

                    What's your exact model of Coral TPU by the way?

                    L 1 Reply Last reply Reply Quote 0
                    • olivierlambertO
                      olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                      last edited by

                      For reference: https://github.com/google-coral/edgetpu/issues/343#issuecomment-1287251821

                      dakota created this issue in google-coral/edgetpu

                      open Apex failing with error -110 (No /dev/apex_0) #343

                      1 Reply Last reply Reply Quote 0
                      • L
                        logical.systems @olivierlambert
                        last edited by

                        @olivierlambert M.2 Accelerator B+M Key
                        https://coral.ai/products/m2-accelerator-bm#description

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO
                          olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                          last edited by

                          So I've heard Qubes OS people did some patches to workaround the broken PCI spec for the device, I need to ask around more details about this.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO
                            olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                            last edited by

                            @andSmv I discussed with Marek from Qubes, he told me that might be relevant (or not): https://lore.kernel.org/xen-devel/20221114192100.1539267-2-marmarek@invisiblethingslab.com/

                            What do you think?

                            1 Reply Last reply Reply Quote 0
                            • andSmvA
                              andSmv Vates 🪐 XCP-ng Team 🚀 Xen Guru 🧙
                              last edited by andSmv

                              Hello, sorry for late response (just discovered the topic) 🙏

                              With regards of Marek patches, I'm actually think it can worth a try (at least the patch seems to treat the problem where MSI-x PBA page is shared with other regs of the device), but there's some cons too:

                              • the patches are quite new (doesn't seems to be integrated yet).
                              • the patches can be applied to more recent Xen (not XCP-ng Xen), and even we could probably backport them, it potentially will require some significant work
                              • we are not 100% sure it's the issue (or the only issue)

                              So If this is a must have, we can go and do some digging to make it work (but still in the scope of "exeperimental" platform, not the production platform)

                              1 Reply Last reply Reply Quote 1
                              • olivierlambertO
                                olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼
                                last edited by

                                We could probably try on a non-XCP-ng platform with a very recent "vanilla" Xen (+Marek patches) and see if it's fixed. If it is, then we could think about a potential backport when 8.3 will include a more recent Xen version 🙂

                                1 Reply Last reply Reply Quote 0
                                • andSmvA
                                  andSmv Vates 🪐 XCP-ng Team 🚀 Xen Guru 🧙
                                  last edited by andSmv

                                  @logical-systems I will check which Xen version the patches are easily applied and If you want I could give you a hand (if needed) to build and install your builded XEN, so you can test if this resolve your issue.

                                  Unfortunatly we don't have the related HW (Coral TPU) to test it by ourselves.

                                  UPDATE: the both patches apply to xen 4.17 (tag RELEASE-4.17.0)

                                  1 Reply Last reply Reply Quote 0
                                  • J jmccoy555 referenced this topic
                                  • First post
                                    Last post