XCP-ng

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    1. Home
    2. andSmv
    • Profile
    • Following 0
    • Followers 0
    • Topics 0
    • Posts 16
    • Best 6
    • Controversial 0
    • Groups 3

    andSmv

    @andSmv

    Vates 🪐 XCP-ng Team 🚀 Xen Guru 🧙

    8
    Reputation
    11
    Profile views
    16
    Posts
    0
    Followers
    0
    Following
    Joined Last Online

    andSmv Unfollow Follow
    Xen Guru 🧙 Vates 🪐 XCP-ng Team 🚀

    Best posts made by andSmv

    • RE: XCP-ng 8.2.1 crash

      Hello, both issues seem to be related to memory corruption.

      • The first trace is an #NMI exception (one of the causes can be a parity error detected by the HW). Moreover, CPU#12 gets the #MC(machine check) exception. The #MC is triggered by the HW to notify the system software that there's an unrecoverable issue with the HW.
      • The second one is the invalid opcode in the Xen Hypervisor context. So it means that either the instruction flow is corrupted, or the instruction pointer is corrupted.

      My hypothesis is:

      In the first case - the ECC memory error is detected (and reported by HW) which makes the hypervisor panic and stop

      In the second case - the memory error is not detected (but the memory is still corrupted) but at some point, this corruption provokes the same result on the Xen hypervisor.

      Can you look with Hetzner guys if there's a way to change memory modules?

      The other way to validate this hypothesis is to install a different system software (another OS/hypervisor, another version of hypervisor) and see if you experience the same issue.

      You can also add on Xen command line "ler=true" option. This can give us more traces (leveraged by HW) to check if there's nothing abnormal on software level. I'll probably will need your Xen image with its symbole table (xen-syms-XXX and xen-syms-XXX.map)

      posted in Compute
      andSmv
      andSmv
    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      Yes. Some of the PCI capabilities are beyond the "standard" PCI configuration space of 256 bytes per BDF (PCI device). And unfortunatly the "enhanced" configuration access method is not provided yet (it's ongoing work) for HVM guests by XEN. It would require from QEMU (xen related part) the chipset emulation which offers an access to such method, such as Q35.

      Very probably, windows drivers for these devices are not happy to not access these fields, so this is potentially the reason of malfunctionning for these devices.

      The good way to confirm this would be to try to passthrough these devices to Linux guests, so we could possibly add some extended traces. And possibly passthrough these devices to PVH Linux guest and see how they are handled (PVH guest do not use QEMU for PCI bus emulation)

      posted in Compute
      andSmv
      andSmv
    • RE: Coral TPU PCI Passthrough

      Hello, sorry for late response (just discovered the topic) 🙏

      With regards of Marek patches, I'm actually think it can worth a try (at least the patch seems to treat the problem where MSI-x PBA page is shared with other regs of the device), but there's some cons too:

      • the patches are quite new (doesn't seems to be integrated yet).
      • the patches can be applied to more recent Xen (not XCP-ng Xen), and even we could probably backport them, it potentially will require some significant work
      • we are not 100% sure it's the issue (or the only issue)

      So If this is a must have, we can go and do some digging to make it work (but still in the scope of "exeperimental" platform, not the production platform)

      posted in Compute
      andSmv
      andSmv
    • RE: Windows Server 2019 sporadic reboot

      Hello @phipra,
      Sorry for late response (didn't see this earlier) 🙏

      Well, this is the tricky one.

      The triple fault could normally be two things:

      • heavy memory corruption (the IDT was corrupted)
      • the normal reboot (if ACPI reboot is not available)

      My question is - can this Windows reboot be a planified Windows reboot (for example related with Windows update mechanism, or something like that ....)?

      It can be obviously be a memory corruption (and possibly done by citrix-vm-tools drivers), but it'll be VERY hard to debug this from a memory dump (and actually AFAIK xen doesn't provide a guest memory dump).

      My suggestions would be - try to enable/disable some Windows Services (disable the Update?) to see if there's some changes.

      Sorry for this very poor insight, but this is related to my rather poor knowledge about Windows OS.

      posted in Compute
      andSmv
      andSmv
    • RE: Google Coral TPU PCIe Passthrough Woes

      Hello @exime,

      Here we have the EPT violation (write access to the r-x page) at 0x3fff2046. This address is tagged as an MMIO address, so very probably belongs to the device you're trying to passthrough.

      Normally this has nothing to do with the 0x46xxx range (where MSI-X caps are pointing)? But the fact that there's some hacking in there make me think that Google engs also did some hacking all the place around.

      Can you please while starting a native guest (or in dom0 before the passthrough) give a PCI dump for your device

      lspci -vvv -s $YOUR_DEV_BDF
      

      YOUR_DEV_BDF is your device PCI id (ex: 00:1:0)

      posted in Compute
      andSmv
      andSmv
    • RE: XCP-ng 8.2.1 crash

      Hmm, in the bugzilla thread the guys talk about adjusting SoC voltage and updating the BIOS. It still seems to me to be a HW problem... I will look through the whole thread and I will do some research about possible workarounds in newer Linux kernels for 5000 series ryzen.

      posted in Compute
      andSmv
      andSmv

    Latest posts made by andSmv

    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      @jevan223 Well, if you confirm it worked well on i440fx that probably the hypothesis is wrong. Whas it kvm-qemu virtualization?

      posted in Compute
      andSmv
      andSmv
    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      @jevan223 This is not about the real hardware. This is about the emulated chipset offered by QEMU to HVM guests (which is the case with Windows VM)

      QEMU actually emulates 2 chipset to its guests

      • i440fx: basic PCI bus with CAM access

      • Q35: enhanced PCI bus with ECAM access (and thus access to PCI-e capabiliites).

      The problem is that Q35 is not supported by xen-dependant parts in QEMU code, so only i440fx is emulated for XEN HVM guests. We are actually working to enable Q35 in XEN, but this is a work in progress.

      Well, this is a hypothesis which needs to be confirmed, but by the look of a lspci output, there's a good chance that's the reason

      posted in Compute
      andSmv
      andSmv
    • RE: Coral TPU PCI Passthrough

      @logical-systems I will check which Xen version the patches are easily applied and If you want I could give you a hand (if needed) to build and install your builded XEN, so you can test if this resolve your issue.

      Unfortunatly we don't have the related HW (Coral TPU) to test it by ourselves.

      UPDATE: the both patches apply to xen 4.17 (tag RELEASE-4.17.0)

      posted in Compute
      andSmv
      andSmv
    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      Yes. Some of the PCI capabilities are beyond the "standard" PCI configuration space of 256 bytes per BDF (PCI device). And unfortunatly the "enhanced" configuration access method is not provided yet (it's ongoing work) for HVM guests by XEN. It would require from QEMU (xen related part) the chipset emulation which offers an access to such method, such as Q35.

      Very probably, windows drivers for these devices are not happy to not access these fields, so this is potentially the reason of malfunctionning for these devices.

      The good way to confirm this would be to try to passthrough these devices to Linux guests, so we could possibly add some extended traces. And possibly passthrough these devices to PVH Linux guest and see how they are handled (PVH guest do not use QEMU for PCI bus emulation)

      posted in Compute
      andSmv
      andSmv
    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      @jevan223 can you please provide a lspci -vvv output (in dom0) ?

      posted in Compute
      andSmv
      andSmv
    • RE: PCI Passthrough of Nvidia GPU and USB add-on card

      Hmm, at first glance looks to me as a real use case for q35 chipset emulation support on XEN ?

      posted in Compute
      andSmv
      andSmv
    • RE: Coral TPU PCI Passthrough

      Hello, sorry for late response (just discovered the topic) 🙏

      With regards of Marek patches, I'm actually think it can worth a try (at least the patch seems to treat the problem where MSI-x PBA page is shared with other regs of the device), but there's some cons too:

      • the patches are quite new (doesn't seems to be integrated yet).
      • the patches can be applied to more recent Xen (not XCP-ng Xen), and even we could probably backport them, it potentially will require some significant work
      • we are not 100% sure it's the issue (or the only issue)

      So If this is a must have, we can go and do some digging to make it work (but still in the scope of "exeperimental" platform, not the production platform)

      posted in Compute
      andSmv
      andSmv
    • RE: Nehalem cpu power management

      Hello @bogikornel

      The Intel_errata_workarounds and probe_c3_errata routines are executed unconditionally (there's no boot parameter to disable these checks)

      The only way through would be to recompile your Xen. I'm not sure that this is a good idea because Intel documentation talks about Unpredictable System Behaviour.

      If that is really what you need, I can help you with this (or do this for you), so let me know...

      posted in Development
      andSmv
      andSmv
    • RE: Windows Server 2019 sporadic reboot

      Hello @phipra

      You're right, this obviously can be a bug in Citrix drivers.

      To confirm this hypothesis you can desinstall Citrix Tools. There will be some impact on performances as you will run on emulated hardware and not paravirtualized, but normally this should work.

      ⚠ BACKUP all important data (snapshot would be a good idea) and follow this procedure https://xcp-ng.org/docs/guests.html#upgrade-from-citrix-xenserver-client-tools to desinstall and clean-up all Citrix add-on software. You would probably want to stop at step 5 if you're not using the scripts and proceeding with manual desinstall. 😄

      Hope this helps

      posted in Compute
      andSmv
      andSmv
    • RE: Windows Server 2019 sporadic reboot

      Hello @phipra,
      Sorry for late response (didn't see this earlier) 🙏

      Well, this is the tricky one.

      The triple fault could normally be two things:

      • heavy memory corruption (the IDT was corrupted)
      • the normal reboot (if ACPI reboot is not available)

      My question is - can this Windows reboot be a planified Windows reboot (for example related with Windows update mechanism, or something like that ....)?

      It can be obviously be a memory corruption (and possibly done by citrix-vm-tools drivers), but it'll be VERY hard to debug this from a memory dump (and actually AFAIK xen doesn't provide a guest memory dump).

      My suggestions would be - try to enable/disable some Windows Services (disable the Update?) to see if there's some changes.

      Sorry for this very poor insight, but this is related to my rather poor knowledge about Windows OS.

      posted in Compute
      andSmv
      andSmv