Re-add a repaired master node to the pool

cairoti

I am doing a lot of testing before putting my environment into production.

Suppose the pool has two nodes, one master node and the other a slave node. In case the master node fails due to hardware issues, I saw that the slave node can be changed to master using the command "xe pool-emergency-transition-to-master".

But when the old master server is repaired, how can I add it back? Won't I have two masters at the same time? Will a conflict occur?

In the tests I performed, when shutting down the master node, the VMs running on it were also shut down and not migrated to the slave node.

Links consulted:

https://xcp-ng.org/forum/topic/8361/xo-new-pool-master
https://xcp-ng.org/forum/topic/4075/pool-master-down-what-steps-need-done-next

DustinB

The process in the topic I listed would step you through the recovery of a standing up a new pool master.

@cairoti said in Re-add a repaired master node to the pool:

In the tests I performed, when shutting down the master node, the VMs running on it were also shut down and not migrated to the slave node.

Are you using shared storage between your master and slave servers? A NAS/SAN

cairoti

@DustinB I use a dedicated Dell SAN storage.

DustinB

@cairoti Are you using Xen Orchestra (source of paid appliance)?

bvitnik

@cairoti It's explained here:

https://docs.xenserver.com/en-us/citrix-hypervisor/dr/machine-failures.html#master-failures

Quote:

If you repair or replace the server that was the original master, you can simply bring it up, install the Citrix Hypervisor software, and add it to the pool. Since the Citrix Hypervisor servers in the pool are enforced to be homogeneous, there is no real need to make the replaced server the master.

Now, there is a catch. I'm not sure what happens with the old master from a pool perspective after a new master is delegated. Is it still considered (and shown) as member of the pool just shut down, or is it kicked out of the pool? Anyway, if old master is returned to the pool, i.e. a join operation is performed, then it's configuration is reset and it will not cause any conflict.

If you don't want to risk anything, the best way to go is to remove old master from the pool, reinstall it and re-add it. That's the clean way. The reinstall will make old master forget it was ever a master

cairoti

@DustinB I use the open community version.

DustinB

@cairoti said in Re-add a repaired master node to the pool:

@DustinB I use the open community version.

I think you should be fine to seize the master role, rebuild your second host and then as Olivier said in one of the links you posted, you have to scrub the old pool master. Though I'm not sure how that specific operation is performed.

Andrew

@cairoti I have had this happen to me.... With HA enabled, when the master fails, a new pool member becomes the new master. With HA enabled and shared storage, designated HA VMs should be restarted on the pool.

If you have stand-alone (per server) storage then VMs on the dead server can't be restarted because their storage is gone.

If you don't have HA enabled then you need to manually force a new master and restart failed VMs.

To force a new master on a pool member use the command xe pool-emergency-transition-to-master from a XCP console. You can't do much to the pool without a master. If your old master just needs to reboot quickly then just do that without changing the pool. If you master is going to be down for a while (more than a few minutes) then pick a new master beforehand, or after as a forced change.

It may take a while to force another pool member to become the new master as the members have to timeout contacting the old dead master.

With a failed master or pool member you have two choices:

Kick out the dead host from the pool forever (from the new master). After you fix/replace the dead host you need to REFORMAT and START OVER with a NEW install of XCP (there may be a reset command, not sure). You can then add the NEW host to the existing pool. You CAN NOT re-add a host that has been kicked out from the pool. IF you reformat the dead host and start over with a clean install, you need to kick the old host from the pool as the new server has a new UUID and you can't replace a pool member (just kick old and add new).
Don't kick out the dead host and fix the hardware (as long as the drives boot the same XCP data as before). Just turn on the host and it should join the pool and know that it was the master, but there is a new master now and it should just become a normal pool host.

If you fix the old master (and did NOT kick it from the pool) then boot it with the same disks, it should just connect to the pool and know there is a newer master. You can re-designate it as the new master if you wish or just let it be a pool member.

Yes, I have done this (again, just now as a test) with XCP 8.2.1

cairoti

@Andrew, @bvitnik and @DustinB In my tests, I did the following. I did this process twice and it worked. To simulate a hardware failure on the master node, I simply turned it off.

If the master pool is down or unresponsive due to a hardware failure, follow these steps to restore operations:

Use an SSH client to log in to a slave host in the pool.
Run the following command on the slave host to promote it as the new pool master:

xe pool-emergency-transition-to-master

Confirm the change of the pool master and verify the hosts present in it:

xe pool-list

xe host-list

Even if it is down, the old master node will appear in the listing.

Remap the pool in XCP-ng or XO using the IP of the new master node.
After resolving the hardware issues on the old master node, start it up. When it finishes booting, it will be recognized as a slave node.

In testing, I did not need to run any other commands. However, if the node is not recognized, try typing on it after accessing it via SSH: xe pool-recover-slaves

I didn't understand why it worked. It seemed like "magic"!