Recovering hosts from a "fire": Multiple hosts in a pool becoming the pool master
-
This is technically a response to this post, but since the issue was resolved (and my fix wasn't the same), I thought that I'd make a write up here in case a poor soul finds themselves in this same situation. I'd also like suggestions on how I could've done this better.
Background
My homelab has some key oddities that should be mentioned first.
- My homelab is entirely colocated. I have no physical access to my servers.
- I have a VPN connection from the colocation to AWS. Since I didn't have direct router access (from the colocation) at the time, this was acting as my reverse proxy out to the public internet with Cloudflare ZeroTrust in front for some security (it's a homelab, I'm not storing HIPPAA info, I'm not too concerned about it).
- On AWS, I have one Docker VM with some services/a reverse proxy. Another VM is my XO From Source that manages the colocated servers. My last VM was a Windows VM just for managing my homelab easier than having to open more ports and configure more routing.
The Issue
One day, I logged into XO to add another host to my pool, and found that the pool didn't show up. So I restarted and updated XO, which didn't resolve the issue. I went to the BMC on one of my servers, and found this:
I restarted the server, but after 300 seconds (and every 300 seconds subsequently) the error would appear again. I logged into the BMC on my other servers, and they were in a similar state.
Now for the reason I mentioned the post at the beginning of my post. I could not SSH into Dom0 because the management network and NICs were no longer seen on any of my XCP-ng hosts despite showing up properly in
ifconfig
.After many hours of troubleshooting, I eventually found out that each of the hosts in the pool thought they were the "master" server, each with a different list of hosts they thought were slaves. Only one server would be accessible from XO at a time, and I couldn't force remove them from the pool since all of them thought they were the pool master.
Resolution
I figured I'd have to migrate these hosts out of the pool somehow. I ended up installing XCP-ng on another server to get another pool created (although I'm sure a VM would have worked if necessary). From XO, I could select "Add Hosts" from the pool option and move one server at a time (
xe pool-join
probably would have worked too, but after 5 hours in the CLI that day, I was ready to do things from the UI). Not only did it move the host to the new pool, but the management network came back and the host was accessible, problem resolved!RCA/Questions
- This was not entirely an XCP-ng fault. The day that the issues started happening, there were record temperatures in the environment where these servers were located (105F or 41C), causing them to power off in the first place.
- Should I have done anything differently? I couldn't find a way to forcefully demote a server from Master to Slave.
-
Did you have HA enabled on this pool? From your description, it sounds like they each promoted themselves to the pool master because the master was no longer reachable.
-
@Danp That makes sense, I did have HA enabled
-
@nick-lloyd There's a file on each host that indicates whether it is a slave / master.
cat /etc/xensource/pool.conf
will displaymaster
on the pool master andslave:<IP address>
on other pool members where <IP address> is the IP of the pool master. -
@Danp Just to make sure I have this correct, if I edited that file, entered the master IP address, and reboot the hosts, that would have resolved the issue?
-
@nick-lloyd I think the answer would depend on the state of your XAPI database given the fact that each host thinks that it in the pool master. To convert one or more of the hosts to a slave, you would need to --
- edit the file so that it contains
slave:<IP address of master>
instead ofmaster
- run
mv /var/xapi/state.db /var/xapi/state.db-old
to get rid of the old xapi database - reboot the host
Once rebooted, the xapi database will be regenerated by syncing with the pool master.
- edit the file so that it contains