Good-day Folks,
I have a small lab environment, running a proof-of-concept using XOA + XCP-ng, to see if this can be a viable solution for air-gapped networks within my organization. The testing has been going smoothly until last night, when I took a snapshot of a VM - something I've done many times in this environment. I generally power down the VM before I grab the snapshot, then power the VM back up. However, this time upon powering the VM up, it booted into the EFI Shell.
Here's what my environment looks like:
- XOA (version not shown - this is an air-gapped instance built for my POC)
- XCP-ng v8.3.0 (VMH01 and VMH02)
Last night I observed the following:
-
In XOA, the Pool was connected to both hosts (VMH01 and VMH02) but I couldn’t disable the connection. Each time I tried, I got the following error: “MISCONF Redis is configured to save RDB snapshots, but it’s currently unable to persist to disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.”
Strangely enough, this error message does not appear in the XOA logs.
-
VMH01 (which is the pool master) is online, responds to pings but refusing all SSH access, so I was not able to check the Redis logs (as directed in the error message). A physical console access needs to be established with VMH01 to further troubleshoot. Given that we’re testing viability of this solution, I didn’t just want to reboot the system without first working through to identify the root cause.
I came in today and attempted to make a physical console connection with the problematic host (VMH01), but I got no video and no keyboard functionality. So, I had no choice but to do a hard reboot. It came back up, rejoined the pool, and assumed the Master role (I assume it was never relinquished).
XOA is up (running on VMH01) and all the VMs are up, but a couple are still booting to the EFI Shell, so need your help figuring out why. The other VMs that are booting fine are using the same SR.
As this is a Proof-of-Concept, I don't want to push the easy button and simply revert the VM to the previous snapshot. I want to identify the root cause and document the resolution. Any guidance is greatly appreciated, thanks.