XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Google Coral TPU PCIe Passthrough Woes

    Scheduled Pinned Locked Moved Compute
    37 Posts 11 Posters 6.3k Views 12 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • E Offline
      exime
      last edited by

      I recently moved to XCP-ng because I had been unable to get the Google Coral TPU to pass through properly to VMs in ESXi. Unfortunately, passing the Coral through in XCP-ng results in another, different failure, with the guest VM crashing as soon as I install the Google drivers that work fine on bare metal installs of the same Ubuntu 20.04 guest.

      The TPU does show up with lspci in the guest, and I've successfully passed through a GPU and a USB controller to different guests.

      I'm not sure what the most relevant logs are, but this is what I see in hypervisor.log:

      [2022-09-03 15:25:35] (XEN) [ 1373.790117] memory_map: error -22 removing dom11 access to [3fff2100,3fff2103]
      [2022-09-03 15:25:35] (XEN) [ 1373.889785] memory_map: error -22 removing dom11 access to [3fff2100,3fff2103]
      [2022-09-03 15:25:35] (XEN) [ 1373.988658] memory_map: error -22 removing dom11 access to [3fff2100,3fff2103]
      [2022-09-03 15:25:35] (XEN) [ 1374.083144] memory_map: error -22 removing dom11 access to [3fff2100,3fff2103]
      [2022-09-03 15:25:35] (XEN) [ 1374.182721] memory_map: error -22 removing dom11 access to [3fff2100,3fff2103]
      [2022-09-03 15:25:41] (XEN) [ 1380.209079] d11v0 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f184680c mfn 0x3fff2046 type 5
      [2022-09-03 15:25:41] (XEN) [ 1380.209083] d11v0 Walking EPT tables for GFN f1846:
      [2022-09-03 15:25:41] (XEN) [ 1380.209086] d11v0  epte 9c000015d1df8107
      [2022-09-03 15:25:41] (XEN) [ 1380.209089] d11v0  epte 9c00000e34a20107
      [2022-09-03 15:25:41] (XEN) [ 1380.209092] d11v0  epte 9c00000ef3120107
      [2022-09-03 15:25:41] (XEN) [ 1380.209094] d11v0  epte 9c5003fff2046945
      [2022-09-03 15:25:41] (XEN) [ 1380.209097] d11v0  --- GLA 0xffffb91f001a180c
      [2022-09-03 15:25:41] (XEN) [ 1380.209107] domain_crash called from vmx_vmexit_handler+0xf55/0x19c0
      [2022-09-03 15:25:41] (XEN) [ 1380.209110] Domain 11 (vcpu#0) crashed on cpu#39:
      [2022-09-03 15:25:41] (XEN) [ 1380.209116] ----[ Xen-4.13.4-9.24.1  x86_64  debug=n   Not tainted ]----
      [2022-09-03 15:25:41] (XEN) [ 1380.209119] CPU:    39
      [2022-09-03 15:25:41] (XEN) [ 1380.209122] RIP:    0010:[<ffffffff9ff8ccdd>]
      [2022-09-03 15:25:41] (XEN) [ 1380.209124] RFLAGS: 0000000000010246   CONTEXT: hvm guest (d11v0)
      [2022-09-03 15:25:41] (XEN) [ 1380.209129] rax: 0000000000000000   rbx: ffffb91f001a1800   rcx: 0000000000000080
      [2022-09-03 15:25:41] (XEN) [ 1380.209132] rdx: ffffb91f001a180c   rsi: 0000000000000001   rdi: 0000000000000000
      [2022-09-03 15:25:41] (XEN) [ 1380.209135] rbp: ffffb91f0059f968   rsp: ffffb91f0059f8f0   r8:  0000000000000000
      [2022-09-03 15:25:41] (XEN) [ 1380.209139] r9:  ffffb91f0059f7a8   r10: ffffb91f00000000   r11: ffffa059148c2f40
      [2022-09-03 15:25:41] (XEN) [ 1380.209142] r12: 0000000000000000   r13: ffffa05916b35000   r14: ffffa0590c723080
      [2022-09-03 15:25:41] (XEN) [ 1380.209145] r15: 000000000000000d   cr0: 0000000080050033   cr4: 00000000001606f0
      [2022-09-03 15:25:41] (XEN) [ 1380.209147] cr3: 00000001945b2001   cr2: 000055c9503250c0
      [2022-09-03 15:25:41] (XEN) [ 1380.209150] fsb: 00007feabbd8d880   gsb: ffffa05918400000   gss: 0000000000000000
      [2022-09-03 15:25:41] (XEN) [ 1380.209153] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
      
      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        Can you try an on older kernel in your VM just to be sure? (Eg Debian 10 guest with default bundled kernel)

        E 2 Replies Last reply Reply Quote 0
        • E Offline
          exime @olivierlambert
          last edited by

          olivierlambert will do!

          1 Reply Last reply Reply Quote 0
          • E Offline
            exime @olivierlambert
            last edited by exime

            olivierlambert said in Google Coral TPU PCIe Passthrough Woes:

            Can you try an on older kernel in your VM just to be sure? (Eg Debian 10 guest with default bundled kernel)

            I'm just now getting back to this.

            Might the problem be related to this issue?

            "Unfortunately the device in question violates PCI specification by mapping PBA, MSI-X vector table, and other registers into same 4KB page (PBA is at 0x46068, VT at 0x46800, but there is a bunch of other registers in 0x46XXX range)."

            https://github.com/google-coral/edgetpu/issues/343#issuecomment-1287251821

            dakota created this issue in google-coral/edgetpu

            open Apex failing with error -110 (No /dev/apex_0) #343

            1 Reply Last reply Reply Quote 2
            • andSmvA Offline
              andSmv Vates 🪐 XCP-ng Team Xen Guru
              last edited by

              Hello exime,

              Here we have the EPT violation (write access to the r-x page) at 0x3fff2046. This address is tagged as an MMIO address, so very probably belongs to the device you're trying to passthrough.

              Normally this has nothing to do with the 0x46xxx range (where MSI-X caps are pointing)? But the fact that there's some hacking in there make me think that Google engs also did some hacking all the place around.

              Can you please while starting a native guest (or in dom0 before the passthrough) give a PCI dump for your device

              lspci -vvv -s $YOUR_DEV_BDF
              

              YOUR_DEV_BDF is your device PCI id (ex: 00:1:0)

              jjggJ 1 Reply Last reply Reply Quote 1
              • olivierlambertO olivierlambert referenced this topic on
              • jjggJ Offline
                jjgg @andSmv
                last edited by

                andSmv I'm pretty sure I'm having the same problem as exime except I'm getting the VM just cold turning off.

                Output from your command on the host:

                [16:09 xenhost04 ~]# lspci -vvv -s 41:00.0
                41:00.0 Non-VGA unclassified device: Global Unichip Corp. Coral Edge TPU (prog-if ff)
                        Subsystem: Global Unichip Corp. Coral Edge TPU
                        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
                        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
                        Latency: 0, Cache Line Size: 64 bytes
                        Interrupt: pin A routed to IRQ 66
                        Region 0: Memory at d01fc000 (64-bit, prefetchable) [size=16K]
                        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=1M]
                        Capabilities: [80] Express (v2) Endpoint, MSI 00
                                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                                DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
                                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                                LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                                LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not Supported
                                DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
                                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                                         Compliance De-emphasis: -6dB
                                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
                        Capabilities: [d0] MSI-X: Enable- Count=128 Masked-
                                Vector table: BAR=2 offset=00046800
                                PBA: BAR=2 offset=00046068
                        Capabilities: [e0] MSI: Enable- Count=1/32 Maskable- 64bit+
                                Address: 0000000000000000  Data: 0000
                        Capabilities: [f8] Power Management version 3
                                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
                        Capabilities: [100 v1] Vendor Specific Information: ID=1556 Rev=1 Len=008 <?>
                        Capabilities: [108 v1] Latency Tolerance Reporting
                                Max snoop latency: 0ns
                                Max no snoop latency: 0ns
                        Capabilities: [110 v1] L1 PM Substates
                                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                                          PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
                        Capabilities: [200 v2] Advanced Error Reporting
                                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                                UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF- MalfTLP+ ECRC+ UnsupReq- ACSViol-
                                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                                CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
                                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
                        Kernel driver in use: pciback
                
                andSmvA 1 Reply Last reply Reply Quote 0
                • andSmvA Offline
                  andSmv Vates 🪐 XCP-ng Team Xen Guru @jjgg
                  last edited by

                  jjgg Can you please also post XEN traces after the VM is stopped.
                  (either in hypervisor.log or just type xl dmesg (under root account in your dom0)

                  jjggJ 1 Reply Last reply Reply Quote 0
                  • jjggJ Offline
                    jjgg @andSmv
                    last edited by jjgg

                    andSmv

                    xl dmesg from the host in question. Note the last set prefixed with 739760 appeared just after I attempted to install the drivers, the depmod process was just running, it turns off before dpkg finishes. I've just included the whole set as I suspect this is from several different attempts.

                    Additional note, I actually have two Coral PCI devices installed. For the final test I only passed one through to the VM.

                    (XEN) [523317.173290] d19v4 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0xd0246 type 5
                    (XEN) [523317.173297] d19v4 Walking EPT tables for GFN f1846:
                    (XEN) [523317.173302] d19v4  epte 9c00005519618007
                    (XEN) [523317.173306] d19v4  epte 9c00002af3b9b007
                    (XEN) [523317.173309] d19v4  epte 9c0000244004d007
                    (XEN) [523317.173313] d19v4  epte 9c500000d0246845
                    (XEN) [523317.173318] d19v4  --- GLA 0xffffb4c880209800
                    (XEN) [523317.173333] domain_crash called from vmx_vmexit_handler+0xea6/0x1b00
                    (XEN) [523317.173338] Domain 19 (vcpu#4) crashed on cpu#23:
                    (XEN) [523317.173348] ----[ Xen-4.13.5-9.30  x86_64  debug=n   Not tainted ]----
                    (XEN) [523317.173351] CPU:    23
                    (XEN) [523317.173355] RIP:    0010:[<ffffffffad54987f>]
                    (XEN) [523317.173359] RFLAGS: 0000000000010202   CONTEXT: hvm guest (d19v4)
                    (XEN) [523317.173369] rax: ffffb4c880209800   rbx: ffffb4c8806a3aac   rcx: 0000000000000000
                    (XEN) [523317.173373] rdx: 00000000fee97000   rsi: ffffb4c8806a3aac   rdi: 0000000000000001
                    (XEN) [523317.173381] rbp: ffff904f409e6380   rsp: ffffb4c8806a3a60   r8:  000000000000000d
                    (XEN) [523317.173385] r9:  ffff904f413d52f8   r10: ffff904f4162a700   r11: 00000000000393e0
                    (XEN) [523317.173389] r12: 0000000000000097   r13: 0000000000000011   r14: 000000000000000d
                    (XEN) [523317.173393] r15: ffff904f413d52f8   cr0: 0000000080050033   cr4: 00000000001706e0
                    (XEN) [523317.173398] cr3: 0000000103736006   cr2: 00007f23d938f6b7
                    (XEN) [523317.173402] fsb: 00007f23d9e298c0   gsb: ffff90563e700000   gss: 0000000000000000
                    (XEN) [523317.173407] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                    (XEN) [524620.959493] d23v1 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0xd0246 type 5
                    (XEN) [524620.959498] d23v1 Walking EPT tables for GFN f1846:
                    (XEN) [524620.959501] d23v1  epte 9c00005730d5f007
                    (XEN) [524620.959504] d23v1  epte 9c0000245bedd007
                    (XEN) [524620.959506] d23v1  epte 9c00002af3b93007
                    (XEN) [524620.959509] d23v1  epte 9c500000d0246845
                    (XEN) [524620.959512] d23v1  --- GLA 0xffffb82103411800
                    (XEN) [524620.959520] domain_crash called from vmx_vmexit_handler+0xea6/0x1b00
                    (XEN) [524620.959523] Domain 23 (vcpu#1) crashed on cpu#10:
                    (XEN) [524620.959529] ----[ Xen-4.13.5-9.30  x86_64  debug=n   Not tainted ]----
                    (XEN) [524620.959531] CPU:    10
                    (XEN) [524620.959534] RIP:    0010:[<ffffffff95027075>]
                    (XEN) [524620.959537] RFLAGS: 0000000000010202   CONTEXT: hvm guest (d23v1)
                    (XEN) [524620.959542] rax: 00000000fee97000   rbx: ffff9213b9164480   rcx: 0000000000000000
                    (XEN) [524620.959545] rdx: ffffb82103411800   rsi: 0000000000000000   rdi: ffffb82103411800
                    (XEN) [524620.959549] rbp: ffffb8210664f9ac   rsp: ffffb8210664f960   r8:  0000000000000001
                    (XEN) [524620.959552] r9:  ffff9213be007600   r10: 0000000000001001   r11: 0000000000001001
                    (XEN) [524620.959556] r12: 0000000000000011   r13: ffff9213ba74f2a8   r14: ffff9213ba74f0a8
                    (XEN) [524620.959558] r15: 0000000000000097   cr0: 0000000080050033   cr4: 00000000001606e0
                    (XEN) [524620.959561] cr3: 00000007c9230004   cr2: 000055d28bf759a8
                    (XEN) [524620.959564] fsb: 00007f9f132264c0   gsb: ffff9213be440000   gss: 0000000000000000
                    (XEN) [524620.959567] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                    (XEN) [524887.457380] d24v0 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0xd0246 type 5
                    (XEN) [524887.457385] d24v0 Walking EPT tables for GFN f1846:
                    (XEN) [524887.457389] d24v0  epte 9c000057150c0007
                    (XEN) [524887.457392] d24v0  epte 9c000022442f9007
                    (XEN) [524887.457395] d24v0  epte 9c00002af3b93007
                    (XEN) [524887.457398] d24v0  epte 9c500000d0246845
                    (XEN) [524887.457401] d24v0  --- GLA 0xffffa20b43411800
                    (XEN) [524887.457411] domain_crash called from vmx_vmexit_handler+0xea6/0x1b00
                    (XEN) [524887.457415] Domain 24 (vcpu#0) crashed on cpu#27:
                    (XEN) [524887.457422] ----[ Xen-4.13.5-9.30  x86_64  debug=n   Not tainted ]----
                    (XEN) [524887.457424] CPU:    27
                    (XEN) [524887.457427] RIP:    0010:[<ffffffff97027075>]
                    (XEN) [524887.457430] RFLAGS: 0000000000010202   CONTEXT: hvm guest (d24v0)
                    (XEN) [524887.457437] rax: 00000000fee97000   rbx: ffff96aab4ed67e0   rcx: 0000000000000000
                    (XEN) [524887.457440] rdx: ffffa20b43411800   rsi: 0000000000000000   rdi: ffffa20b43411800
                    (XEN) [524887.457445] rbp: ffffa20b4ca4f9ac   rsp: ffffa20b4ca4f960   r8:  0000000000000001
                    (XEN) [524887.457448] r9:  ffff96aabe007600   r10: 0000000000001001   r11: 0000000000001001
                    (XEN) [524887.457452] r12: 0000000000000011   r13: ffff96aaba7482a8   r14: ffff96aaba7480a8
                    (XEN) [524887.457456] r15: 0000000000000097   cr0: 0000000080050033   cr4: 00000000001606f0
                    (XEN) [524887.457459] cr3: 00000007eb0ea003   cr2: 00007ff2ba681fa7
                    (XEN) [524887.457463] fsb: 00007ff2ba6844c0   gsb: ffff96aabe400000   gss: 0000000000000000
                    (XEN) [524887.457466] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                    (XEN) [525662.181596] d29v1 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0xd0046 type 5
                    (XEN) [525662.181599] d29v1 Walking EPT tables for GFN f1846:
                    (XEN) [525662.181602] d29v1  epte 9c00005519559007
                    (XEN) [525662.181604] d29v1  epte 9c000022442f9007
                    (XEN) [525662.181605] d29v1  epte 9c0000284bed3007
                    (XEN) [525662.181607] d29v1  epte 9c500000d0046845
                    (XEN) [525662.181609] d29v1  --- GLA 0xffffb41dc3409800
                    (XEN) [525662.181615] domain_crash called from vmx_vmexit_handler+0xea6/0x1b00
                    (XEN) [525662.181617] Domain 29 (vcpu#1) crashed on cpu#0:
                    (XEN) [525662.181621] ----[ Xen-4.13.5-9.30  x86_64  debug=n   Not tainted ]----
                    (XEN) [525662.181623] CPU:    0
                    (XEN) [525662.181625] RIP:    0010:[<ffffffffb8c27075>]
                    (XEN) [525662.181626] RFLAGS: 0000000000010202   CONTEXT: hvm guest (d29v1)
                    (XEN) [525662.181630] rax: 00000000fee97000   rbx: ffff8abaf4d72060   rcx: 0000000000000000
                    (XEN) [525662.181632] rdx: ffffb41dc3409800   rsi: 0000000000000000   rdi: ffffb41dc3409800
                    (XEN) [525662.181635] rbp: ffffb41dc67b79ac   rsp: ffffb41dc67b7960   r8:  0000000000000001
                    (XEN) [525662.181637] r9:  ffff8abafe007600   r10: 0000000000001001   r11: 0000000000001001
                    (XEN) [525662.181639] r12: 0000000000000011   r13: ffff8abafa7492a8   r14: ffff8abafa7490a8
                    (XEN) [525662.181641] r15: 0000000000000097   cr0: 0000000080050033   cr4: 00000000001606e0
                    (XEN) [525662.181643] cr3: 00000007f6df2006   cr2: 000055f23e274c78
                    (XEN) [525662.181645] fsb: 00007f3828fce4c0   gsb: ffff8abafe440000   gss: 0000000000000000
                    (XEN) [525662.181647] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                    (XEN) [739760.414382] d30v3 EPT violation 0x1aa (-w-/r-x) gpa 0x000000f1846800 mfn 0xd0046 type 5
                    (XEN) [739760.414387] d30v3 Walking EPT tables for GFN f1846:
                    (XEN) [739760.414391] d30v3  epte 9c00005730d5f007
                    (XEN) [739760.414394] d30v3  epte 9c000022442f8007
                    (XEN) [739760.414396] d30v3  epte 9c0000224437a007
                    (XEN) [739760.414399] d30v3  epte 9c500000d0046845
                    (XEN) [739760.414402] d30v3  --- GLA 0xffffae9003411800
                    (XEN) [739760.414411] domain_crash called from vmx_vmexit_handler+0xea6/0x1b00
                    (XEN) [739760.414414] Domain 30 (vcpu#3) crashed on cpu#9:
                    (XEN) [739760.414420] ----[ Xen-4.13.5-9.30  x86_64  debug=n   Not tainted ]----
                    (XEN) [739760.414422] CPU:    9
                    (XEN) [739760.414425] RIP:    0010:[<ffffffff97427075>]
                    (XEN) [739760.414428] RFLAGS: 0000000000010202   CONTEXT: hvm guest (d30v3)
                    (XEN) [739760.414433] rax: 00000000fee97000   rbx: ffff96a3f95d7cc0   rcx: 0000000000000000
                    (XEN) [739760.414437] rdx: ffffae9003411800   rsi: 0000000000000000   rdi: ffffae9003411800
                    (XEN) [739760.414441] rbp: ffffae900d0779ac   rsp: ffffae900d077960   r8:  0000000000000001
                    (XEN) [739760.414444] r9:  ffff96a3fe007600   r10: 0000000000001001   r11: 0000000000001001
                    (XEN) [739760.414448] r12: 0000000000000011   r13: ffff96a3fa7482a8   r14: ffff96a3fa7480a8
                    (XEN) [739760.414451] r15: 0000000000000097   cr0: 0000000080050033   cr4: 00000000001606e0
                    (XEN) [739760.414454] cr3: 00000007f7266001   cr2: 00007f22e42b42b0
                    (XEN) [739760.414457] fsb: 00007f22e412f4c0   gsb: ffff96a3fe4c0000   gss: 0000000000000000
                    (XEN) [739760.414460] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
                    
                    andSmvA 1 Reply Last reply Reply Quote 0
                    • andSmvA Offline
                      andSmv Vates 🪐 XCP-ng Team Xen Guru @jjgg
                      last edited by

                      jjgg Thank you. Yes the same problem - ept violation.. Look, I'll try to figure out what we can do here. There's a patch that comes from Qubes OS guys that normally shold fix the MSI-x PBA issue (not sure that this is the good fix, but still... worth trying) This patch applies on recent Xen and wasn't accepted yet. I will take a look if it can be easily backported to XCP-ng Xen and come back to you.

                      E jjggJ 2 Replies Last reply Reply Quote 1
                      • E Offline
                        exime @andSmv
                        last edited by

                        andSmv thanks!

                        jjgg glad you're providing the info, sorry for abandoning the thread

                        1 Reply Last reply Reply Quote 0
                        • jjggJ Offline
                          jjgg @andSmv
                          last edited by

                          andSmv thanks. Completely understand this appears more of a hardware issue here, but happy to test anything. The host that these cards are installed in isn't running critical infrastructure and can be rebooted relatively easily.

                          exime your initial post was a top Google result, was still helpful! Thanks.

                          1 Reply Last reply Reply Quote 0
                          • andSmvA Offline
                            andSmv Vates 🪐 XCP-ng Team Xen Guru
                            last edited by

                            jjgg Here's the link to xen.gz.

                            You need to put it in your /boot folder (backup your existent file!) and make sure your grub.cfg is pointing to it.

                            But first: Backup all you want to backup! The patch is totally untested and doesn't apply as is (so I needed to adapt it). Normally not such a big deal and should not do no harm, but... you never know.

                            I'm also not sure that the issue would be fixed. We unfortunatelly do not have Coral TPU device at Vates, so we can't do the more deep analysis on this. The guy who wrote this patch tried to fix other device.

                            exime - this is 4.13.5 XCP-ng patched xen, so there's chances it wouldn't work for you (from what I saw you're running 4.13.4 xen)

                            Anyway, if we have good news, we'll find the way to fix it for everybody.

                            E jjggJ 2 Replies Last reply Reply Quote 0
                            • E Offline
                              exime @andSmv
                              last edited by

                              andSmv ack - I'll wait and see if it works out for jjgg since my Xen server is in active use

                              1 Reply Last reply Reply Quote 0
                              • jjggJ Offline
                                jjgg @andSmv
                                last edited by jjgg

                                Ok so testing setup.

                                Downloaded xen.gz, renamed to xenept.gz and put into /boot:

                                df0800ca-42eb-4675-a4cd-f0afc93ab3fd-image.png

                                Updated grub.cfg to point to that file:

                                1d2216bc-7194-4acb-8607-1513b348f6ca-image.png

                                Rebooted host.

                                Got to the grub screen, let it load as normal, as soon as it disappeared (so it had made the default selection) the server rebooted.

                                I ended up just placing xen.gz in /boot and removing the symbolic link to the original and attempting, no difference.

                                Of note, messing with kernels / grub is not something I've got experience with. I may have made a mistake / need things explained in a bit more detail if I'm potentially misunderstood some instructions above.

                                jjggJ 1 Reply Last reply Reply Quote 0
                                • jjggJ Offline
                                  jjgg @jjgg
                                  last edited by

                                  Booted into fallback and put things back the way they were. Happy to keep testing if there's additional bits to test.

                                  J 1 Reply Last reply Reply Quote 0
                                  • J Offline
                                    jmccoy555 @jjgg
                                    last edited by

                                    Ah, just found these things exist..... Then just found this issue exists too ☹️

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      If only they could have done PCI hardware that follow the PCI specifications 😢

                                      J jjggJ 2 Replies Last reply Reply Quote 0
                                      • J Offline
                                        jmccoy555 @olivierlambert
                                        last edited by

                                        Maybe this one will come to life again https://xcp-ng.org/forum/topic/7066/coral-tpu-pci-passthrough/14

                                        Don't really want to buy one knowing its not working!!

                                        1 Reply Last reply Reply Quote 0
                                        • jjggJ Offline
                                          jjgg @olivierlambert
                                          last edited by

                                          Definitely frustrating and no fault of xcp-ng - I have a lot of spare cpu cycles so it isn't majorly impacting me that I know of. I'm still available to test fixes though.

                                          Looks like most of the Proxmox users have got this working in an LXC container by installing the drivers on the host itself and passing through the actual Apex devices. Not a route that's applicable to us but just a datapoint.

                                          J 1 Reply Last reply Reply Quote 0
                                          • J Offline
                                            jmccoy555 @jjgg
                                            last edited by

                                            jjgg it would be great if we could get this working. My CPU utilisation is fine too, but when I shut down my Zoneminder VM things go a lot quieter (fans) so I'm sure there would be a benefit CPU and power wise.

                                            NornodeN 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post