XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Google Coral TPU PCIe Passthrough Woes

    Scheduled Pinned Locked Moved Compute
    37 Posts 11 Posters 6.3k Views 12 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • andSmvA Offline
      andSmv Vates 🪐 XCP-ng Team Xen Guru @jjgg
      last edited by

      @jjgg Thank you. Yes the same problem - ept violation.. Look, I'll try to figure out what we can do here. There's a patch that comes from Qubes OS guys that normally shold fix the MSI-x PBA issue (not sure that this is the good fix, but still... worth trying) This patch applies on recent Xen and wasn't accepted yet. I will take a look if it can be easily backported to XCP-ng Xen and come back to you.

      E jjggJ 2 Replies Last reply Reply Quote 1
      • E Offline
        exime @andSmv
        last edited by

        @andSmv thanks!

        @jjgg glad you're providing the info, sorry for abandoning the thread

        1 Reply Last reply Reply Quote 0
        • jjggJ Offline
          jjgg @andSmv
          last edited by

          @andSmv thanks. Completely understand this appears more of a hardware issue here, but happy to test anything. The host that these cards are installed in isn't running critical infrastructure and can be rebooted relatively easily.

          @exime your initial post was a top Google result, was still helpful! Thanks.

          1 Reply Last reply Reply Quote 0
          • andSmvA Offline
            andSmv Vates 🪐 XCP-ng Team Xen Guru
            last edited by

            @jjgg Here's the link to xen.gz.

            You need to put it in your /boot folder (backup your existent file!) and make sure your grub.cfg is pointing to it.

            But first: Backup all you want to backup! The patch is totally untested and doesn't apply as is (so I needed to adapt it). Normally not such a big deal and should not do no harm, but... you never know.

            I'm also not sure that the issue would be fixed. We unfortunatelly do not have Coral TPU device at Vates, so we can't do the more deep analysis on this. The guy who wrote this patch tried to fix other device.

            @exime - this is 4.13.5 XCP-ng patched xen, so there's chances it wouldn't work for you (from what I saw you're running 4.13.4 xen)

            Anyway, if we have good news, we'll find the way to fix it for everybody.

            E jjggJ 2 Replies Last reply Reply Quote 0
            • E Offline
              exime @andSmv
              last edited by

              @andSmv ack - I'll wait and see if it works out for @jjgg since my Xen server is in active use

              1 Reply Last reply Reply Quote 0
              • jjggJ Offline
                jjgg @andSmv
                last edited by jjgg

                Ok so testing setup.

                Downloaded xen.gz, renamed to xenept.gz and put into /boot:

                df0800ca-42eb-4675-a4cd-f0afc93ab3fd-image.png

                Updated grub.cfg to point to that file:

                1d2216bc-7194-4acb-8607-1513b348f6ca-image.png

                Rebooted host.

                Got to the grub screen, let it load as normal, as soon as it disappeared (so it had made the default selection) the server rebooted.

                I ended up just placing xen.gz in /boot and removing the symbolic link to the original and attempting, no difference.

                Of note, messing with kernels / grub is not something I've got experience with. I may have made a mistake / need things explained in a bit more detail if I'm potentially misunderstood some instructions above.

                jjggJ 1 Reply Last reply Reply Quote 0
                • jjggJ Offline
                  jjgg @jjgg
                  last edited by

                  Booted into fallback and put things back the way they were. Happy to keep testing if there's additional bits to test.

                  J 1 Reply Last reply Reply Quote 0
                  • J Offline
                    jmccoy555 @jjgg
                    last edited by

                    Ah, just found these things exist..... Then just found this issue exists too ☹️

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Online
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by

                      If only they could have done PCI hardware that follow the PCI specifications 😢

                      J jjggJ 2 Replies Last reply Reply Quote 0
                      • J Offline
                        jmccoy555 @olivierlambert
                        last edited by

                        Maybe this one will come to life again https://xcp-ng.org/forum/topic/7066/coral-tpu-pci-passthrough/14

                        Don't really want to buy one knowing its not working!!

                        1 Reply Last reply Reply Quote 0
                        • jjggJ Offline
                          jjgg @olivierlambert
                          last edited by

                          Definitely frustrating and no fault of xcp-ng - I have a lot of spare cpu cycles so it isn't majorly impacting me that I know of. I'm still available to test fixes though.

                          Looks like most of the Proxmox users have got this working in an LXC container by installing the drivers on the host itself and passing through the actual Apex devices. Not a route that's applicable to us but just a datapoint.

                          J 1 Reply Last reply Reply Quote 0
                          • J Offline
                            jmccoy555 @jjgg
                            last edited by

                            @jjgg it would be great if we could get this working. My CPU utilisation is fine too, but when I shut down my Zoneminder VM things go a lot quieter (fans) so I'm sure there would be a benefit CPU and power wise.

                            NornodeN 1 Reply Last reply Reply Quote 1
                            • NornodeN Offline
                              Nornode @jmccoy555
                              last edited by

                              @jmccoy555 // @jjgg

                              Did anyone of you get your Coral USB TPU working and passthrough to a VM?

                              jjggJ 1 Reply Last reply Reply Quote 0
                              • jjggJ Offline
                                jjgg @Nornode
                                last edited by

                                @Nornode hey, nope I did not. I ended up moving my infrastructure to Proxmox.

                                Honestly this is no fault of XCP-ng and XCP-ng suits my hardware / setup a lot better, but it was either that or I had two servers that needed to be bare metal.

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Online
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by olivierlambert

                                  PCI passthrough might cause problems with this device, but USB could work.

                                  DustyArmstrongD 1 Reply Last reply Reply Quote 0
                                  • DustyArmstrongD Offline
                                    DustyArmstrong @olivierlambert
                                    last edited by

                                    @olivierlambert Seems like a reasonable place to ask as any - I am currently using a USB Coral over IP (Virtualhere) but would rather load it into my VM directly - what's the current status of snapshots/backups with a vUSB?

                                    I've been reading that XO can now support disk exclusions with [NOBAK] but this probably doesn't apply to a Coral. Is an offline backup still the best available method?

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Online
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      For NOBAK and on 8.3 yes, but I'm not sure it will be related to USB. You should use offline, that should work. Alternatively, we have plans to detect the error, to unplug the USB device, do the snap and replug it just after.

                                      DustyArmstrongD 1 Reply Last reply Reply Quote 1
                                      • DustyArmstrongD Offline
                                        DustyArmstrong @olivierlambert
                                        last edited by

                                        @olivierlambert Thanks that's good to know. That functionality would be great down the line!

                                        I do have a spare M.2 E-key on my XCP host running the VM Coral is needed for, but seems like I'd have trouble going by this thread. Might even have trouble with the USB Coral, it hasn't been much better so far in terms of whacky non-standard behavior...

                                        DustyArmstrongD 1 Reply Last reply Reply Quote 0
                                        • DustyArmstrongD Offline
                                          DustyArmstrong @DustyArmstrong
                                          last edited by

                                          So I eventually got round to trying the USB Coral via passthrough, which worked great, but the TPU itself exhibited some behavior that made it nonviable which sucks. The USB was actually detected by XO as Google Inc. and Frigate actually loaded the TPU, but the inference speed was in excess of 180ms (it should be around 10, USB over IP it's 40). So it worked but, didn't.

                                          The normal procedure with a Coral is to run a make reset from their utilities which switches the TPU back to runtime mode. This worked under my current (and now reverted) system of VirtualHere USB over IP, but it didn't work when passed through.

                                          Output of make reset:

                                          dfu-util: Warning: Invalid DFU suffix signature
                                          dfu-util: A valid DFU suffix will be required in a future dfu-util release
                                          dfu-util: No DFU capable USB device available
                                          

                                          It should look like this:

                                          Opening DFU capable USB device...
                                          Device ID 1a6e:089a
                                          Device DFU version 0101
                                          Claiming USB DFU Interface...
                                          Setting Alternate Interface #0 ...
                                          Determining device status...
                                          DFU state(2) = dfuIDLE, status(0) = No error condition is present
                                          DFU mode device DFU version 0101
                                          Device returned transfer size 256
                                          Copying data from PC to DFU device
                                          Download	[=========================] 100%        10783 bytes
                                          Download done.
                                          DFU state(2) = dfuIDLE, status(0) = No error condition is present
                                          Done!
                                          Resetting USB to switch back to Run-Time mode
                                          

                                          Sorry to ping you @olivierlambert but would you happen to know what might cause this in XCP/XO? Is there something going on when the device is made into a vUSB that would cause it to error out/be inaccessible in DFU (I assume this means Device Firmware Update)?

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Online
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            Hi,

                                            I don't know internal mechanism of the vUSB thing and why it cause this on your device (which is really a special device, with its own quirks).

                                            I don't remember if you already tried to passthrough a PCIe USB adapter card, then plug the USB device on it and see if it's better than vUSB?

                                            DustyArmstrongD 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post