XCP-ng v8.3 Host Crashing Upon Console Login and Performing Any Action
-
Good-day Folks,
ENVIRONMENT:
- XCP-ng v8.3 (single node pool)
- XO v5.101 (updated on 12/01/2024 - unable to login and provide the branch ID)
- Server Hardware: HP ProLiant DL360p Gen8
- System ROM/BIOS: vP71 05/24/2019
I'm dealing with a weird situation on a single node pool running XCP-ng v8.3. Over last weekend I upgraded XO to v5.101, which went very smoothly. After it completed and I logged in, I noticed a notification on the host informing me that there were updates to be applied. So I mindlessly applied them and it restarted the
xe-toolstack
. Considering I'd done this many times before, I didn't think to check on anything, so I walked away.It wasn't until yesterday I attempted to login to XO that I noticed it wasn't responding. I then attempted to SSH into XO, but the connection attempt was immediately refused. Getting a bit anxious now, I attempted to SSH into the host to see if perhaps the XO VM had stopped, and the SSH connection attempt was immediately refused. So I connected to the iDRAC and to my horror, saw that no VMs were running. Any attempt to login at the console and perform any actions, immediately results in the following error.
Anybody else ever seen this?
One other important factor to point out:
While connected via iDRAC I noticed that the System Health was showing as
degraded
. Upon inspection, it seems one of the HDDs is about to fail (and may likely have already failed quietly but has yet to be marked as so). As such, I've ordered some replacement drives and have resigned myself to the notion that I may have to rebuild the RAID array and rebuild this host. That's because this is a RAID 0 setup.I'm not the kind that likes to play the blame game, but I am interested in finding root causes so that knowledge can be shared back to the project to make XCP-ng even better. As such, if anyone has any insight into the above error, please share.
As it stands now, I cannot do anything with the host - so not even sure how to pull logs.
-
Good evening all,
Just a quick update. It's been a couple of days now after the rebuild and everything seems to be humming along fine, so I believe this topic can be marked as resolved. I can confidently conclude that this wasn't an XCP-ng issue, although the error message seems a bit misleading.
-
Wow, that's not great indeed.
The console too small, is pretty odd, as this should be at least the standard 80x24 for any kind of VGA.
Have you tried, from the idrac to login on the third terminal (alt+F3) ? It would be good to see if you can login or if the update + failing disk really broke everything. On the 2nd terminal (alt+F2) you should have system messages too, which in the ideal case should be empty…
-
@bleader I know, right! Glad this isn't a major production system - mainly for my church and primarily running infrastructure services (
DHCP, ADDS, DNS, etc.
). Fortunately, I have a physical DC which has the DNS role as well, so I can do without DHCP while I troubleshoot. If need be, I can static critical systems.Now to your suggestion, I tried to access
alt+F2
andalt+F3
from iDRAC and neither were recognized, so looks like I may have to make a site visit and do it directly from the console.However, it's interesting that after leaving the system alone in that error state, it eventually went into emergency mode (see screenshot below).
-
Unfortunately, looks like my issue was ultimately a failed disk in a RAID 0 array. The errors XCP-ng was throwing were definitely misleading.
I just rebuilt the array as RAID 10 and have reinstalled XCP-ng v8.3. I should have the entire virtual infrastructure rebuilt in no time, before services this weekend.
-
Good evening all,
Just a quick update. It's been a couple of days now after the rebuild and everything seems to be humming along fine, so I believe this topic can be marked as resolved. I can confidently conclude that this wasn't an XCP-ng issue, although the error message seems a bit misleading.
-
Thanks for letting us know, and I'm happy you have thing working nicely now.
I think to mark this as resolved you need to convert your original post as a question, and it can then be marked as resolved. I actually cannot do it myself, I think only a few people have the permission to do it for others at Vates. -
-