XCP-ng host restarts at random intervals

christopher-petzel

I have an XCP-ng installation (8.2.1, all patches but most recent) which will restart at random intervals. Usually this interval is a couple months but has been as short as a week. This started just over one year ago. This server has been running since 2018 (with XCP-ng upgrades). The server is in a single host pool.

This isn't a normal crash. A kernel panic does not occur. There is no indication of a shutdown. The kernel just stops then is booting a couple of seconds later.
Kdump is working but there is no logging from kdump when this happens. I can force a kernel panic and I get logging by kdump when I force it so I know kdump is working.

I would expect this to be a hardware issue however the hardware does not restart. The hardware remains running. The kernel will restart. I know this by monitoring hardware, kernel uptime and reviewing log data.

There is no consistency in time of day or day of week. This usually occurs when the one VM on the server is idle.

I'm unable to find any indication in any log that something's gone wrong. I only can find the kernel restarting.

I've tried many hardware configurations, updated firmware on the system board, and RAID controller over the past year and continue to have the same results. I have re-installed XCP-ng and also have experienced the same issue through various patches applied though the past year.

If there is a way that this could be caused by hardware without leaving any trace and not rebooting the hardware, I don't know what that could be but I'd be happy to hear any ideas.

Does anyone have any thoughts on what I could monitor or what I might look into? The one thing I've not done is move the one VM on the host to another host. I don't suspect the VM itself is the cause because there is usually no load on the VM when the restart occurs. There are licensing entanglements which result in about 24 hours of downtime and require a re-install of software though the provider's support if I move the VM - so I've not done this for testing.

olivierlambert

Hi,

So I suppose there's no trace at all in dmesg or even xl dmesg (and their corresponding log files).

edit: at this point, I would run memtest on that machine to be sure.

christopher-petzel

@olivierlambert Sadly, nothing showing but the restart. I'll run Memtest. I think I did this at some point but I don't have a record of doing it and it's not a bad idea anyway.

christopher-petzel

I was able to run Memtest for 86 hours, completed 9 passes and had no memory errors.

Maybe I'm approaching this incorrectly. I've been assuming there is a problem in dom0 since the kernel is starting without any indication as to why. What could be telling the kernel to restart? The hardware never restarts and since dom0 restarts in a very short period of time (seconds, from best I can tell), the hypervisor seems to keep running.

I have a very limited knowledge of this stack so I know I could be completely wrong.

olivierlambert

Hmm interesting. Have you took a look on Xen side of things in terms of logs?

christopher-petzel

I was wrong about the hypervisor, it is restarting. I confused myself and didn't make the connection.

In /var/log/xen/hypervisor.log... I see an entry Logfile Opened with the timestamp of when the log rotates then I see another Logfile Opened at the timestamp that the hypervisor restarts, followed by the Xen log data during boot.

So I guess I need to be thinking about why the hypervisor is restarting. Now I'm questioning if the hardware is restarting. I have not seen a hardware restart in the IPMI data and the recovery time seemed too short for a hardware restart HOWEVER the lack of evidence is not evidence itself so I think my next move will be to monitor the hardware in a way that I can confirm or deny a hardware restart.

Thanks for your help @olivierlambert . It may be a couple of months before this happens again but I'll report back what I find once it happens.

christopher-petzel

@olivierlambert I have been able to confirm this is a hardware reboot. Since I've been working this issue for a year and the restarts were so rare, at some point I convinced myself that the hardware was not restarting even thought my monitoring and logging was telling me otherwise.

Thanks for your help in guiding me to reconsider what I thought I already knew. Thankfully the restarts have become more frequent and I have had 3 reboots in 10 days. That frequency has allowed me to catch what was really happening.

olivierlambert

Ah "great" news then Is there anything else we can do to help?

christopher-petzel

@olivierlambert Just tell people to stick with HP hardware This problem server is a SuperMicro system board and it's the second of the same model of which I've had a hardware problem. The other board stopped working completely so it was a different failure mode. Once I obsolete this hardware, I will have no more SuperMicro boards in production.

olivierlambert

I'm not entirely surprised (we tell people to use Dell or HPE). Sometimes there's a bit of lottery for Supermicro, but we also know hosting companies using SM at scale without problem…

christopher-petzel

Since I last posted on this topic, I've found that the random reboots only occur when there are Windows Server VMs on the host (Tested with 2019 and 2022). The issue will not occur when running Linux VMs.

My issue seems very similar to the problem described (and solved) in https://xcp-ng.org/forum/topic/6683/windows-server-2019-sporadic-reboot/7

The difference is that in my case, the host restarted and in the other post, the poster reports that the VMs are restarting. Since the poster also tested RAM and found no problems but was able to solve the issue by replacing a suspected DIMM, that information may be useful in the host reboot scenario that I experience.

FYI, I have not replaced the RAM yet and may not actually do it since the server in question is aging and will likely be replaced (with HP hardware) soon.

olivierlambert

Thanks for keeping us posted

splastunov

Hello!

Are all VMs on this host belongs to you and you certainly know what processes running on them?

I had same issue with Dell R630.
The solution was to update to latest BIOS.
I think that some clients ran some software that triggered some bug and host rebooted.

XCP-ng security updates does not helped.
In my case only BIOS update fixed suddenly crushes.

So the work around will be to move VMs one by one to another host and check if it will solve the problem.

christopher-petzel

@splastunov Yes, all VMs are for in-house use and all were built by me personally.

I have previously followed the same steps that you followed in your case. I updated the BIOS on the host server and moved VMs one by one.

Moving VMs one by one is how I eventually found that I only had the problem when a Windows Server VM was on the host. When I had this problem occur with a fresh Windows Server 2022 VM which had no applications installed, I started to suspect that it was related to Windows. I was then able to confirm that this only occurred with Windows VMs.

Thanks for the info. I think these are great steps toward finding the problem.

olivierlambert

The /var/crash folder might also being interesting (Dom0.log and Xen log to see who is triggering the crash)

christopher-petzel

I believe I have the definitive cause for this 'random host reboot' issue.

After 6 months of problem-free operation, I have experienced the host reboot issue again on this server. The host was running only Linux VMs, so the theory of Windows VMs on the host contributing to the reboot issue has proven false. As with each time before, there are no indications in any relevant log files that the host is going to reboot. I think at this point I can definitively say that the reboot is caused by a faulty SuperMicro motherboard.

I've learned my lesson: use HPE servers! This SuperMicro system will be melted down for scrap.

olivierlambert

Thanks for the feedback

Well, at least keep us posted if you have the same issue with another hardware, we'll be happy to help

Chmura

Hi @olivierlambert
Now, I have the same problem on 4 servers. Machines reset every few hours!!! Please HELP.

The machines have been running stably since:

reboot system boot 4.19.0+1 Wed Dec 28 12:30 - 05:50 (217+16:19)

Since then, the following patches have been installed but not restarted:

May 16 09:07:40 Updated: xen-libs-4.13.5-9.30.3.xcpng8.2.x86_64
May 16 09:07:41 Updated: guest-templates-json-1.9.6-1.2.xcpng8.2.noarch
May 16 09:07:41 Updated: xcp-ng-release-presets-8.2.1-6.x86_64
May 16 09:07:41 Updated: xen-hypervisor-4.13.5-9.30.3.xcpng8.2.x86_64
May 16 09:07:42 Updated: xen-dom0-libs-4.13.5-9.30.3.xcpng8.2.x86_64
May 16 09:07:43 Updated: xen-tools-4.13.5-9.30.3.xcpng8.2.x86_64
May 16 09:07:44 Updated: xen-dom0-tools-4.13.5-9.30.3.xcpng8.2.x86_64
May 16 09:07:48 Updated: xcp-ng-release-config-8.2.1-6.x86_64
May 16 09:07:49 Updated: xcp-ng-release-8.2.1-6.x86_64
May 16 09:07:49 Updated: guest-templates-json-data-other-1.9.6-1.2.xcpng8.2.noarch
May 16 09:07:50 Updated: guest-templates-json-data-linux-1.9.6-1.2.xcpng8.2.noarch
May 16 09:07:50 Updated: guest-templates-json-data-windows-1.9.6-1.2.xcpng8.2.noarch
May 16 09:07:51 Updated: sudo-1.8.23-10.el7_9.3.x86_64
May 16 09:08:01 Updated: linux-firmware-20190314-5.1.xcpng8.2.noarch
May 16 09:08:03 Updated: 2:microcode_ctl-2.1-26.xs23.1.xcpng8.2.x86_64
May 29 06:57:47 Updated: xen-libs-4.13.5-9.31.1.xcpng8.2.x86_64
May 29 06:57:48 Updated: xcp-ng-release-presets-8.2.1-9.x86_64
May 29 06:57:49 Updated: message-switch-1.23.2-4.1.xcpng8.2.x86_64
May 29 06:57:50 Updated: forkexecd-1.18.1-2.1.xcpng8.2.x86_64
May 29 06:57:50 Updated: xen-hypervisor-4.13.5-9.31.1.xcpng8.2.x86_64
May 29 06:57:51 Updated: xen-dom0-libs-4.13.5-9.31.1.xcpng8.2.x86_64
May 29 06:57:56 Updated: 2:qemu-4.2.1-4.6.3.1.xcpng8.2.x86_64
May 29 06:58:00 Updated: xen-tools-4.13.5-9.31.1.xcpng8.2.x86_64
May 29 06:58:01 Updated: xen-dom0-tools-4.13.5-9.31.1.xcpng8.2.x86_64
May 29 06:58:03 Updated: xenopsd-0.150.14-1.1.xcpng8.2.x86_64
May 29 06:58:03 Updated: xenopsd-cli-0.150.14-1.1.xcpng8.2.x86_64
May 29 06:58:05 Updated: xenopsd-xc-0.150.14-1.1.xcpng8.2.x86_64
May 29 06:58:06 Updated: gpumon-0.18.0-4.3.xcpng8.2.x86_64
May 29 06:58:06 Updated: xcp-rrdd-1.33.2-1.1.xcpng8.2.x86_64
May 29 06:58:08 Updated: rrdd-plugins-1.10.8-5.2.xcpng8.2.x86_64
May 29 06:58:09 Updated: xapi-tests-1.249.28-1.2.xcpng8.2.x86_64
May 29 06:58:13 Updated: xapi-core-1.249.28-1.2.xcpng8.2.x86_64
May 29 06:58:16 Updated: sm-2.30.8-2.1.xcpng8.2.x86_64
May 29 06:58:20 Updated: xcp-ng-release-config-8.2.1-9.x86_64
May 29 06:58:21 Updated: xcp-ng-release-8.2.1-9.x86_64
May 29 06:58:22 Updated: 2:microcode_ctl-2.1-26.xs25.1.xcpng8.2.x86_64
May 29 06:58:28 Updated: linux-firmware-20190314-7.1.xcpng8.2.noarch
May 29 06:58:33 Updated: xapi-xe-1.249.28-1.2.xcpng8.2.x86_64
May 29 06:58:34 Updated: varstored-guard-0.6.2-2.xcpng8.2.x86_64
May 29 06:58:35 Updated: xcp-networkd-0.56.2-2.xcpng8.2.x86_64
May 29 06:58:36 Updated: sm-rawhba-2.30.8-2.1.xcpng8.2.x86_64
Jul 28 10:10:40 Updated: xen-libs-4.13.5-9.34.1.xcpng8.2.x86_64
Jul 28 10:10:41 Updated: xen-hypervisor-4.13.5-9.34.1.xcpng8.2.x86_64
Jul 28 10:10:42 Updated: xen-dom0-libs-4.13.5-9.34.1.xcpng8.2.x86_64
Jul 28 10:10:42 Updated: xen-tools-4.13.5-9.34.1.xcpng8.2.x86_64
Jul 28 10:10:44 Updated: xen-dom0-tools-4.13.5-9.34.1.xcpng8.2.x86_64
Jul 28 10:10:54 Updated: linux-firmware-20190314-8.1.xcpng8.2.noarch

Yesterday morning at 5:30 to 5:50 I reset the all servers (zenbleed patch), since then i have random reboots on all 4 servers.

server1: 2x AMD EPYC 7282, ASUS Mainboard

reboot   system boot  4.19.0+1         Thu Aug  3 10:57 - 13:25 (1+02:27)   
reboot   system boot  4.19.0+1         Thu Aug  3 07:33 - 13:25 (1+05:51)
reboot   system boot  4.19.0+1         Thu Aug  3 05:57 - 13:25 (1+07:27)   
reboot   system boot  4.19.0+1         Thu Aug  3 05:36 - 13:25 (1+07:48)

serwer2: 2x AMD EPYC 7282, ASUS Mainboard

reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:25  (00:18)    
reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:25  (13:04)    
reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:25 (1+05:34)
reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:25 (1+07:30)

Server3: 2x AMD EPYC 7282, Supermicro Mainboard

reboot   system boot  4.19.0+1         Fri Aug  4 13:07 - 13:14  (00:06)    
reboot   system boot  4.19.0+1         Fri Aug  4 00:21 - 13:14  (12:53)    
reboot   system boot  4.19.0+1         Thu Aug  3 07:51 - 13:14 (1+05:23)   
reboot   system boot  4.19.0+1         Thu Aug  3 05:55 - 13:14 (1+07:19)

server4: 2x AMD EPYC 7282, Supermicro Mainboard

reboot   system boot  4.19.0+1         Fri Aug  4 00:33 - 13:26  (12:52)    
reboot   system boot  4.19.0+1         Thu Aug  3 05:46 - 13:26 (1+07:40)

What can I provide you to solve the problem.

Hardware issues ruled out, power supply also OK (2 power supplies, 2 independent outlets).

In /var/crash i have old file

ls -al /var/crash/
-rw-r--r--  1 root root 67108864 2022-12-28  .sacrificial-space-for-logs

When one server restarted, I catch It and that was a full machine restart POST BIOS.

Please help

Danp

@Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

Chmura

@Danp said in XCP-ng host restarts at random intervals:

@Chmura There's a pending fix for a problem with the zenbleed patch. You may want to test it out to see if it resolves your rebooting issue. See here for more details.

Thanks for fast reply.

Now for test on serwer3 i downgrade this package:

yum downgrade linux-firmware-20190314-5.1.xcpng8.2.noarch

And I will test stability.

On serwer4 i downgrade all packages to my 27.12.2022 state:

xen-libs-4.13.4-9.28.1.xcpng8.2.x86_64
message-switch-1.23.2-3.2.xcpng8.2.x86_64
forkexecd-1.18.1-1.1.xcpng8.2.x86_64
vhd-tool-0.43.0-4.1.xcpng8.2.x86_64
1:xs-openssl-libs-1.1.1k-6.1.xcpng8.2.x86_64
xen-hypervisor-4.13.4-9.28.1.xcpng8.2.x86_64
xen-dom0-libs-4.13.4-9.28.1.xcpng8.2.x86_64
2:qemu-4.2.1-4.6.2.1.xcpng8.2.x86_64
xen-tools-4.13.4-9.28.1.xcpng8.2.x86_64
edk2-20180522git4b8552d-1.4.6.xcpng8.2.x86_64
xen-dom0-tools-4.13.4-9.28.1.xcpng8.2.x86_64
xenopsd-0.150.12-1.2.xcpng8.2.x86_64
xenopsd-xc-0.150.12-1.2.xcpng8.2.x86_64
xenopsd-cli-0.150.12-1.2.xcpng8.2.x86_64
xcp-rrdd-1.33.0-6.1.xcpng8.2.x86_64
squeezed-0.27.0-5.xcpng8.2.x86_64
rrdd-plugins-1.10.8-5.1.xcpng8.2.x86_64
gpumon-0.18.0-4.2.xcpng8.2.x86_64
xapi-tests-1.249.26-2.1.xcpng8.2.x86_64
blktap-3.37.4-1.0.1.xcpng8.2.x86_64
xapi-core-1.249.26-2.1.xcpng8.2.x86_64
2:microcode_ctl-2.1-26.xs23.xcpng8.2.x86_64
sm-rawhba-2.30.7-1.3.xcpng8.2.x86_64
rrd2csv-1.2.5-7.1.xcpng8.2.x86_64
kernel-4.19.19-7.0.15.1.xcpng8.2.x86_64
xapi-xe-1.249.26-2.1.xcpng8.2.x86_64
xcp-networkd-0.56.2-1.xcpng8.2.x86_64
openvswitch-2.5.3-2.3.12.1.xcpng8.2.x86_64
xapi-storage-script-0.34.1-2.1.xcpng8.2.x86_64
varstored-guard-0.6.2-1.xcpng8.2.x86_64
sm-2.30.7-1.3.xcpng8.2.x86_64
sm-cli-0.23.0-7.xcpng8.2.x86_64
xcp-ng-xapi-plugins-1.7.2-1.xcpng8.2.noarch
linux-firmware-20190314-5.xcpng8.2.noarch
xapi-nbd-1.11.0-3.2.xcpng8.2.x86_64
xcp-ng-pv-tools-8.2.0-11.xcpng8.2.noarch

Now I will evacuate all VMs from server2 to server3/4 and check the microcode package from xcp-ng-testing repo.
We'll see what comes out when i use yum update "xen-*" --enablerepo=xcp-ng-testing
Funny weekend

Edit: Server3 was restarted at 9PM ;(
Server 4 and update Server2 (xen-... 4.13.5-9.35.1.xcp ng 8.2) still working,