New Xcp-Ng server Run-Away
Greetings, I am new to Xen-Ng but not to virtualization. I have an issues that I hope you can help with:
I recently bought an HP DL-160 Gen8 in the following configuration for a home lab.
HP DL-160 Gen8
and migrated a server to this host machine. After a bit this migrated machine host and guest started to run erratic and stalled Xcp-Ng server and my internal LAN which was disappointing.
Then I created a VM on the server with the following configuration:
Oracle Linux 8
This machine is not busy and was a very simple install.
Installation and configuration of the server went without issue and as expected.
However, I have a problem. After the guest has run for (10 minutes, 1 hour, 1 day, Who knows how long) the processors on this guest including the RAM shoots through the roof and the guest server starts to scream like a jet about to take off for no apparent reason.
I'm not sure how to resolve this or how to trouble shoot except to turn off the host and start this process over all over again.
Any support appreciated,
And Many Thanks
xl dmesg? Also, if there's anything weird, check your guest logs. Might be a problem in there too.
And the usual suspect: hardware. Check you are in latest version for BIOS, firmware, memory check etc.
@kulmacet First make sure the BIOS and iLO are up to date (or for a G8, the latest/last version). There are known issues with some older versions.
Check iLO and the IML to see if there are hardware errors listed. Check the BIOS settings. Defaults are NOT always the best choice. Normal fans suddenly running fast points to very high heat/usage or hardware issues. HP likes to spin the fans fast when the server has hardware issues. Using 4 cores on a 16 core machine should not cause high load. Disable HT, at least as a test. XCP/Xen does not like HT for some older CPUs. Check IPMI SEL for additional hardware issues (from ipmitool in dom0). You can also run the HP diags and other system tests to see if it catches any issues.
I have many DL360p G8 systems and they work well. The DL160 is a cheaper hardware design and known to have occasional hardware problems but XCP should work if the system is healthy.
The IML will log hardware errors that the system (dmesg) won't see.
Everything in the hardware and BIOS all look correct but still get the run away server. Even the logs do not report any issues. It's just not running correctly and am surprised at the lack of logging.
Still not correct.
This is really weird, I don't remember seeing this behavior in the past, like ever :thinking:
@fohdeesha any idea?
@kulmacet I've had something like this, and it turned out to be a dying BMC controller in the end.
I had the server motherboard replaced under warranty, and poof, problem stopped (same CPUs, RAM, etc).
Maybe I'm looking at this wrong... It's feature!
@kulmacet Can you recreate this with other VMs, or just this specific oracle linux VM? I would spin up a new debian VM for example and shut this problematic VM off, and see if the issue happens with this VM as well. Outside of that, it's really looking like a hardware issue. Also, double check the ILO and BIOS firmware are at the latest. I can almost guarantee it shipped with ancient versions, and many issues like this have been patched relatively recently.