Group Details Private

Xen Guru

Member List
  • RE: Nvidia MiG Support

    Hello, I'm honestly don't know how Citrix vGPU stuff works, but couple of thoughts on this topic:

    If I understand correctly, you say Nvidia use VFIO Linux framework to enable mediated devices which can be exported to guest. The VFIO framework isn't supported by XEN, as VFIO need the presense of IOMMU device managed by IOMMU Linux kernel driver. And XEN doesn't proide/virtualize IOMMU access to dom0 (XEN manages IOMMU by itself, but doesn't offer such access to guests)

    Bascally to export SR-IOV virtual function to guest with XEN you don't have to use VFIO, you can just assign the virtual function PCI "bdf" id to guest and normally the guest should see this device.

    From what I understand Nvidia user-mode toolstack (scripts & binaries) doesn't JUST create SR-IOV virtual functions, but want to access VFIO/MDEV framework, so all this thing fails.

    So may be, you can check if you there's some options with Nvidia tools to just create SR-IOV functions, OR try to run VFIO in "no-iommu" mode (no IOMMU presence in Linux kernel required)

    BTW, we working on some project where we are intending to use VFIO with dom0, and so we're implementing the IOMMU driver in dom0 kernel, so it would be interesting to know in the future, if this can help with your case.

    Hope this help

    posted in Compute
  • RE: VM's with around 24GB+ crashes on migration.

    It's obviously is not exluded that the issue is related to the memory footprint. Moreover the first warning "complains" about failure on memory allocation. (I suppose that the "receiver" node has enough memory to host the VM).

    Normally XEN hasn't limitations on Live Migration 24GB VM. So, it's difficult to say what's the issue here. But clearly there's a possibity that this is a bug in XEN/toolstack... Memory fragmentation on the receiver" node can be an issue too.

    You can probably run some different configurations to try to pinpoint this issue.
    May be for the start try to migrate a VM when no other VMs are running on the "receiver" node. Also try to migrate a VM with no network connections (as the issue seems to be related to network backend status changes)....

    posted in Compute
  • RE: Weird kern.log errors

    Yeah, The HW problem seems to be a good guess.

    The track that we can follow here is xen_mc_flush kernel function which raises a warning when a multicall (hypercall wrapper) fails. The interesting thing here would be to take a look at XEN traces. You can type xl dmesg in dom0 to see if XEN tells something more (if it isn't happy on some reason)

    posted in Compute
  • RE: VM's with around 24GB+ crashes on migration.

    Hmmm, there's two poblems here (page alloc failure warning and NULL pointer BUG) in context of xenwatch kernel thread and basically both of them happenning when configuring XEN network frontend/backend communications.

    Normally this isn't related to memory footprint of the VM, but rather to XEN frontend/backend xenbus communication framework. Does the bugs desappear when you reduce the memory size for the VM and when all others params/environnement are the same?

    posted in Compute
  • RE: Google Coral TPU PCIe Passthrough Woes

    @jjgg Here's the link to xen.gz.

    You need to put it in your /boot folder (backup your existent file!) and make sure your grub.cfg is pointing to it.

    But first: Backup all you want to backup! The patch is totally untested and doesn't apply as is (so I needed to adapt it). Normally not such a big deal and should not do no harm, but... you never know.

    I'm also not sure that the issue would be fixed. We unfortunatelly do not have Coral TPU device at Vates, so we can't do the more deep analysis on this. The guy who wrote this patch tried to fix other device.

    @exime - this is 4.13.5 XCP-ng patched xen, so there's chances it wouldn't work for you (from what I saw you're running 4.13.4 xen)

    Anyway, if we have good news, we'll find the way to fix it for everybody.

    posted in Compute
  • RE: Google Coral TPU PCIe Passthrough Woes

    @jjgg Thank you. Yes the same problem - ept violation.. Look, I'll try to figure out what we can do here. There's a patch that comes from Qubes OS guys that normally shold fix the MSI-x PBA issue (not sure that this is the good fix, but still... worth trying) This patch applies on recent Xen and wasn't accepted yet. I will take a look if it can be easily backported to XCP-ng Xen and come back to you.

    posted in Compute
  • RE: Google Coral TPU PCIe Passthrough Woes

    @jjgg Can you please also post XEN traces after the VM is stopped.
    (either in hypervisor.log or just type xl dmesg (under root account in your dom0)

    posted in Compute