Hello everyone,
I've got a strange issue on my XCP-ng installation at home.
Short story:
In short, a CentOS 8 VM that I have keeps crashing whenever I try to perform updates or give it some IO load (more on this later). I seem to have the same issue with a Windows Server 2019 VM which also crashes sometimes while performing updates (but am unable to reproduce this consistently). A brand new CO8 VM does not have this issue neither other older CO VMs (5 and 6) on the same host.
Symptoms:
When I say crashing, from the XO console for the CO8 VM it simply looks like a reboot. I lose SSH and I see the VM rebooting, no error messages nor anything. There is nothing in the VM's logs either following the crash. I tried finding logs in XCP-ng and either they're empty or I'm not looking in the right place.
The VM is not running any services/workload either. The crash is reproducible whenever I run a yum update or an IO benchmark (such as fio)
Troubleshooting that I did:
At first, I suspected RAM issues. I ran memtest86 overnight with no errors reported. I even changed out the RAM from another working computer, but the same issues happened. So I do not suspect this is bad RAM.
The XCP-ng install is running in Software RAID1 (configured by the XCP-ng installer) over a pair of RAID edition SATA disks. I do not suspect the disks are bad, as other VMs have no issues neither are there any errors reported in logs.
I then thought perhaps this was a bug with XCP-ng 8.1, so I upgraded to 8.2 recently. Same issue persists with no difference.
I also doubt that it's a hardware problem in general as my other CentOS 5 and CentOS 6 VMs run rock solid, no matter how hard I hit the IO. A brand new CO8 VM was also able to complete IO benchmarks without crashing.
Nothing crazy was done on the crashing CO8 VM either, it's running the stock CO kernel. I even uninstalled the xen guest tools in case it was an issue with them.
My question is as follows:
What could I do to to troubleshoot this further? Does XCP-ng have a log that could give me clues?
I'm happy that this is happening in my home testing environment and not production, but I'd like to resolve this issue and gain insight into how I could troubleshoot this if I'm ever faced with such a thing in prod.
Thanks!