Good-day Folks,
A few days ago, I got myself into a little jam while trying to do what I thought was the proper way of handling the reboot of the only storage server in my test lab. Now, I managed to get myself out of trouble but I'm here for guidance on how I could've done things differently. So, here's what happened.
For those who don't know, I'm running a small test lab where I'm testing out the Vates VMS stack as a viable drop-in replacement for VMware's VCF stack. Unfortunately, I don't have a lot of funding, so don't have a lot of hardware. As such, I only have 4 physical machines that I had available to dedicate as servers. I used three of them as XCP-ng hosts and turned the last into an Active Directory Domain Controller, DHCP Server, as well as the File Server (SMB/CIFS and NFS). I also have attached to this same box an 8TB external HDD which I'm sharing out over NFS and using as Remotes (to test the backup features of XO). This entire setup isn't ideal, but hey, it's what I got - and it works. Actually, the fact that the Vates VMS stack works in such a condition alone is a huge testament to the resiliency of the solution. Anyway, I digress; back to the issue at hand.
Given the above setup, a need arose to reboot this server (let's call it DC01). After reading through this documentation - https://docs.xen-orchestra.com/manage_infrastructure#maintenance-mode-1 - I decided that it was a good idea to place the SRs into Maintenance Mode before doing the reboot. I had done this before in another environment (at my church) and never ran into the problems I'm about to describe (however, in hindsight, I think the difference may have been that the VDI of the XO VM was local to the host it was running on).
When I clicked the button to enable maintenance mode, it gave me the usual warning that the running VMs will be halted, so I said OK to proceed. What I didn't realize was that because the XO appliance was running with its VDI on the SR that I just put in maintenance mode, I would immediately lose connectivity with it and it would subsequently refuse to start. I had a backup plan; to use XCP-ng Center (vNext) to connect to the pool master and attempt to see if I could start the XOA VM; hoping that maybe I'll get prompted to move the VDI - but this never happened. The startup attempts kept failing, citing a timeout error. So, running out of ideas, I simply decided to reboot all three hosts, hoping that once they came back up, they would reconnect the SRs and then I would be able to start up XOA. Unfortunately for me, the reboots took a very long time to complete. So long that I left the lab (this was around 9pm) and returned around 2am. Not sure exactly how long it took for the reboot to happen, but when I got back to the lab all hosts were back up but no SRs were connected.
At this point, my thinking was that the SRs didn't reconnect on each host likely because XOA wasn't running to instruct them to (don't know if this is entirely accurate). I googled around and found that I could re-introduce the SRs directly on each host by using the xe pbd-plug/unplug commands. Strangely enough, while I was able to run those commands on the CLI of each host without any errors, the SR reconnected on only one host. It wasn't until I used XCP-ng Center (vNext) to perform a repair on the SR. That's when it clearly showed that the SR was connected to Host #2 but not #1 and #3. So I proceed through the wizard and it successfully repaired the connections. I was then able to successfully start the XOA VM and got the lab back up and running.
So my ultimate question:
When the VDI of the XOA or XOCE VM resides on an SR that's being targeted for maintenance mode enablement, what is the proper procedure?
Thanks in advance to anyone who reads through my long narration and then offers a response. You are very much appreciated!