Windows Server 2019 sporadic reboot

phipra

Hi,
we have 3 Windows Server 2019 VMs running on xcp-ng 8.2.1 since a couple of months. We are using a local ZFS Storage (I think the storage is probably not the problem). While Ubuntu/FreeBSD VMs run without issue, the Windows VMs do random restarts approx. every 4-6 weeks - apart from that, they run fine otherwise. I have the impression, that restarts occur more frequently, when there is cpu load (but that could be wrong). There is no error or kernel dump in the windows event logs. I did install the citrix-vm-tools 9.2.3 package, Windows Update is enabled. When the error occurs xl dmesg shows:

(XEN) [2116059.042574] d60v4 Triple fault - invoking HVM shutdown action 3
(XEN) [2116059.042576] *** Dumping Dom60 vcpu#4 state: ***
(XEN) [2116059.042579] ----[ Xen-4.13.4-9.27.1  x86_64  debug=n   Not tainted ]----
(XEN) [2116059.042579] CPU:    42
(XEN) [2116059.042580] RIP:    0033:[<00007ffae7ed9aa0>]
(XEN) [2116059.042581] RFLAGS: 0000000000000287   CONTEXT: hvm guest (d60v4)
(XEN) [2116059.042582] rax: 000001d375b62dd0   rbx: 000001d375b62e00   rcx: 000001d375b62dc0
(XEN) [2116059.042583] rdx: 000001d375b62e00   rsi: 0000000000000000   rdi: 000001d375b64000
(XEN) [2116059.042584] rbp: 0000006e841fedf0   rsp: 0000006e841fece0   r8:  ba7dc7ea4e7a3d2c
(XEN) [2116059.042584] r9:  bbb5ec78748d7131   r10: 41be9084b28bde26   r11: 08c1d1fe828bde7b
(XEN) [2116059.042585] r12: 0000000000000000   r13: 000000000000fdff   r14: 000001d39f1e8000
(XEN) [2116059.042585] r15: 0000000000000001   cr0: 0000000080050033   cr4: 00000000001506f8
(XEN) [2116059.042586] cr3: 0000002f78779000   cr2: 0000000000000000
(XEN) [2116059.042586] fsb: 0000000000000000   gsb: 0000006efbfe8000   gss: ffffd080b289d000
(XEN) [2116059.042587] ds: 002b   es: 002b   fs: 0053   gs: 002b   ss: 002b   cs: 0033

I would be very grateful, if anyone could point me into the right direction how to solve this issue. Could this be a Guest related driver issue? Is there a way to get a memory dump from that vm when the crash occurs to use windbg and find the driver?
Thanks to all xcp-ng people for their contributions!
Regards,
phipra

phipra

Hi @andSmv

thanks for your tips - I have been trying some time and did not have any luck with it. However one time the whole xcp-ng host went down and rebooted and then I got an entry in the Supermicro IPMI with an uncorrectable ECC memory error. Memtest x86 did not report any faults but after changing that particular DIMM module the reboots and triple faults just stopped. So I think it is safe to assume that these error occured because of a hardware failure and the VM triple faulted because of that memory segment not being available.

I just wanted to report back for anyone encountering this. Thank you for your help.

olivierlambert

Hmmm First time seeing a triple fault.

Invoking @andSmv to take a look on Monday if he can

andSmv

Hello @phipra,
Sorry for late response (didn't see this earlier)

Well, this is the tricky one.

The triple fault could normally be two things:

heavy memory corruption (the IDT was corrupted)
the normal reboot (if ACPI reboot is not available)

My question is - can this Windows reboot be a planified Windows reboot (for example related with Windows update mechanism, or something like that ....)?

It can be obviously be a memory corruption (and possibly done by citrix-vm-tools drivers), but it'll be VERY hard to debug this from a memory dump (and actually AFAIK xen doesn't provide a guest memory dump).

My suggestions would be - try to enable/disable some Windows Services (disable the Update?) to see if there's some changes.

Sorry for this very poor insight, but this is related to my rather poor knowledge about Windows OS.

phipra

Hello @andSmv,

thanks a lot for your insight. I think I can rule out the second option (normal reboot). One of the VMs (the one the crash dump above is from) is doing scheduled reboots every night. Also Windows updates do occur from time to time with regular reboots.

I think you are right with option 1. As far as my limited understanding of x86 assembly goes a triple fault is the second exception during interrupt handling. The interrupt table should be in the Windows kernel memory space, so only device drivers should be able to corrupt it (if I am not mistaken).
Does anybody know a way to identify driver problems in a Windows XEN VM? For example in the register dump above the instrcution pointer is present. Would it be possible to save Windows kernel memory space from time to time and when a fault occurs backreference the faulty driver? Actually, I am not sure about Adress Space Layout Randomization, but I think it is possible to turn it off.
The reboots are not very often, but they are a problem for me the last couple of months - so any help or ideas are greatly appreciated!

Cheers,
phipra

andSmv

Hello @phipra

You're right, this obviously can be a bug in Citrix drivers.

To confirm this hypothesis you can desinstall Citrix Tools. There will be some impact on performances as you will run on emulated hardware and not paravirtualized, but normally this should work.

BACKUP all important data (snapshot would be a good idea) and follow this procedure https://xcp-ng.org/docs/guests.html#upgrade-from-citrix-xenserver-client-tools to desinstall and clean-up all Citrix add-on software. You would probably want to stop at step 5 if you're not using the scripts and proceeding with manual desinstall.

Hope this helps

phipra

Hi @andSmv

thanks for your tips - I have been trying some time and did not have any luck with it. However one time the whole xcp-ng host went down and rebooted and then I got an entry in the Supermicro IPMI with an uncorrectable ECC memory error. Memtest x86 did not report any faults but after changing that particular DIMM module the reboots and triple faults just stopped. So I think it is safe to assume that these error occured because of a hardware failure and the VM triple faulted because of that memory segment not being available.

I just wanted to report back for anyone encountering this. Thank you for your help.

olivierlambert

Thank you very much @phipra : it's important to know what happened and this is really helpful for others experiencing similar issues!