I have an XCP-ng installation (8.2.1, all patches but most recent) which will restart at random intervals. Usually this interval is a couple months but has been as short as a week. This started just over one year ago. This server has been running since 2018 (with XCP-ng upgrades). The server is in a single host pool.
This isn't a normal crash. A kernel panic does not occur. There is no indication of a shutdown. The kernel just stops then is booting a couple of seconds later.
Kdump is working but there is no logging from kdump when this happens. I can force a kernel panic and I get logging by kdump when I force it so I know kdump is working.
I would expect this to be a hardware issue however the hardware does not restart. The hardware remains running. The kernel will restart. I know this by monitoring hardware, kernel uptime and reviewing log data.
There is no consistency in time of day or day of week. This usually occurs when the one VM on the server is idle.
I'm unable to find any indication in any log that something's gone wrong. I only can find the kernel restarting.
I've tried many hardware configurations, updated firmware on the system board, and RAID controller over the past year and continue to have the same results. I have re-installed XCP-ng and also have experienced the same issue through various patches applied though the past year.
If there is a way that this could be caused by hardware without leaving any trace and not rebooting the hardware, I don't know what that could be but I'd be happy to hear any ideas.
Does anyone have any thoughts on what I could monitor or what I might look into? The one thing I've not done is move the one VM on the host to another host. I don't suspect the VM itself is the cause because there is usually no load on the VM when the restart occurs. There are licensing entanglements which result in about 24 hours of downtime and require a re-install of software though the provider's support if I move the VM - so I've not done this for testing.