Transfer Pool Master / only partial success
-
Hi,
we are currently testing xcp-ng 8.3 as a complement for our vmware clusters to reduce our vmware footprint.Environment:
Our test-environment consists of four servers distributed over four sites. VMs are stored on Netapp NFS-Volumes, all four servers each have two 10G or 25G interfaces bonded to a LAG connected to a pair of serverswitches. Each LAG contains xcp-management VLAN, storage VLAN and production VLANs of the VMs. There are already some Test-VMs on the cluster. All four server have the same patch level. Host 1 is Cluster Master, 2 to 4 are cluster members.Problem:
Currently I have the the scenario where you loose network connection to Host 1 (Master). XOA lost contact to the cluster and I tried to elevate Server 2 to be the new master and used "pool-emergency-transition-to-master" on Server 2 CLI. The command seemed to be executed sucessfully and I was able to access log in in its XO lite-GUI (but did not change something there) and saw the test VMs running. Next I went to XOA-GUI and added Server 2 at Settings / Servers and the 4 cluster hosts reappeared and Server 2 has the Master tag, Hosts 1,3 and 4 are shown as disabled.
But Master (Host 2) has a grey dot (Status: halted) and Available Buttons in its Advanced Menu are "Disable Maintenance Mode" and "Enable", as if the server itself is in Maintenance Mode and Disabled. The VMs are still running on this server. When I click "Disable Maintenance Mode" oder "Enable", a red popup appears "server still booting".So I assume that the transfer of pool master role is not fully completed yet. In some board links I also found commands like:
- xe pool-designate-new-master host-uuid=<Slave UUID>
- xe pool-recover-slaves
But I did not find "offical" guides how to use them. For example for "command pool-designate-new-master"... do I have to use it in my case and do I use it on the new new master server to onboard the other slaves or do I use it on the slaves to send them to the new master?
The other "surving" host 4 is currently shown as disabled but there is still a test VM running on the server. I assume I have to link Host 4 to new Master Host 2 and there is still something to complete the Master transfer to Host 2?
-
-
@xoahtw The CLI input you're looking to use in a scenario like this would be
pool-emergency-transition-to-master
The command you used, is for a healthy pool where you haven't lost the existing master.
Additional details can be found here: https://docs.xcp-ng.org/appendix/cli_reference/
Edited to reference the correct command and to post the link.
-
@DustinB : Hi Dustin,
pool-emergency-transition-to-master
is what I used, like discribed in my first post... because the Master was not available any more. I wonder if something is still missing.
-
@xoahtw If the host was online and running normally that you ran this command on, while the pool master was offline it should transition over and manage the remainder of the pool.
Of course you'd have to configure your XO/XOA to use the new master so that you can see and manage your VMs.
Also sorry, I mis-read your post and the last bulleted lines stuck with me.
-
Yes, on XOA I added / changed the Data to the access credentials of Host 2 and the cluster reappeared with Host 2 having the Master tag.
But Master (Host 2) has a grey dot (Status: halted) and Available Buttons in its Advanced Menu are "Disable Maintenance Mode" and "Enable", as if the server itself is in Maintenance Mode and Disabled. When I click "Disable Maintenance Mode" oder "Enable", a red popup appears "HOST_STILL_BOOTING()". VMs are running on this host.Restart Toolstack did not change the behaviour.
edit: and in Task log I see a flood of "API call: host.isPubKeyTooShort" failed tasks
-
positive Update: today I wanted to continue with this problem, but when I logged Into XOA today, Master status was Green / enabled.
Since no one execpt me has credentials to the Testcluster, I am sure that no one else had done something. So the master host seemed to recovered itself from grey to green within the last 8 Hours. There were at least 1-2 hours yesterday between execution of "pool-emergency-transition-to-master" and still beeing grey dot / Halted.
xe pool-emergency-reset-master master-address=... then added the remaining hosts to the new master.
-