VMs are abruptly getting shutdown
-
Is this only occurring with some VMs or all of them? Have you checked the logs for any clues? https://docs.xcp-ng.org/troubleshooting/log-files/
-
We have faced multiple occurrences of server failures while using xcp-ng.
We are using version 8.2.0 And ISO is booted on HPE physical servers.the device shuts down unexpectedly causing the Virtual machines in the server to crash and leading to a downtime of our running application.We tried to analyze the log files of the bay and the VM both but could not find any such result that could prove why the bay was shut down.Please guide step by step how to analyze & resolve the issue -
This appears to be the same issue that you posted about back in January. Why haven't these hosts been updated to 8.2.1 and fully patched?
-
will the issue be resolved after the upgrade???How can we determine the reason for virtual machines shutting down automatically?
-
will the issue be resolved after the upgrade???
Maybe... but you won't know until you try it.
How can we determine the reason for virtual machines shutting down automatically?
If your hosts are failing, then that would explain why the VMs are shutting down.
- Which model of HPE servers are you running?
- What version of the BIOS is currently installed?
- What brand of NICs are installed?
-
@Danp
1.We utilize the HPE ProLiant BL460c Gen10 server model.
2.Our BIOS version is 2.72_09-29-2022.
3.We have deployed network interface cards (NICs) manufactured by Emulex Corporation. -
@lritinfra said in VMs are abruptly getting shutdown:
Our BIOS version is 2.72_09-29-2022.
Once again, you are not current on patching your systems as there have been 6 new BIOS releases since that one.
-
@Danp We're utilizing a total of 16 bays with same bios & same xcpng iso, but why did this issue occur specifically in only two or three bays of production?
-
Faulty memory / hardware?
-
@lritinfra Are there any entries in the logs on the HPE iLO as its health monitoring may give you some clues?
Depending on maintenance for those problematic servers is it possible to run Intelligent Provisioning then have it perform the in depth tests of Insight Diagnostics tools?
The Insight Diagnostics tools will test all parts of the system hardware including, drives, memory, storage etc. Letting you know about any parts which fail these tests.
As well as more thoroughly than the non-in depth tests so is more likely to ferret out any hardware issues, as long as its up to date so it can notice any issues if and when firmware on hardware is tested.
-
@john-c We've reviewed both the vm and Bay logs and found no records related to the shutdown. Currently, Intelligent Provisioning is disabled in our system, and we're unable to enable it as we're currently in production.
-
@lritinfra said in VMs are abruptly getting shutdown:
@john-c We've reviewed both the vm and Bay logs and found no records related to the shutdown. Currently, Intelligent Provisioning is disabled in our system, and we're unable to enable it as we're currently in production.
Unfortunately HPE Intelligent Provisioning is the most reliable way to run the hardware diagnostics. As the online version of Insight Diagnostics are only available on Windows or Linux. Though XCP-ng is Linux based, its not a good idea to install and run the Linux version, due this being a custom instance of Linux dedicated to being a hypervisor host.
HPE Insight Diagnostics Online also needs direct access to the hardware in order to work, so can't be in a VM.
As the software package can likely lead to a broken instance of XCP-ng, that's if the software is even compatible enough to be able to run in a reliable manner.
Is there any policies or processes that can be carried out temporarily to run the tests?
As with this repeated abrupt shutdown of the VMs, can't be doing them any good and the servers themselves. Because crashes at the wrong moment, can really do a number to data. One such wrong moment is if the VM or app running on it is writing data, as the event leads to an interruption to the writing action, thus leaving the file incomplete due to invalid data causing corruption.
-
@lritinfra Something to consider also the HPE Intelligent Provisioning is the main way, outside of HPE iLO, HPE SUM or HPE SPP to update the server's hardware firmware. If you aren't using individual RPMs or SCEXE files for the task. With HPE Intelligent Provisioning and HPE SPP being able to update, both firmware and BIOS.
As not all of the updates for firmware will be in a compatible format, for use with HPE iLO. I'm not sure if it has changed but an Administrator Password set on the BIOS (at minimum), also locks out (disables) access to the Erase option on the HPE Intelligent Provisioning. At least it does on my only HPE Server running an up to date BIOS, HPE iLO and HPE Intelligent Provisioning.
Thus disabled HPE Intelligent Provisioning doesn't help with being up to date enough to fix vulnerabilities and bugs at hardware or firmware level.