Other 2 hosts reboot when 1 host in HA enabled pool is powered off
-
Hello,
Before I elaborate more on the problem below are some details of the setup for testing.
Hardware
VH1: Dell PowerEdge R640, 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz256, 256 GB RAM
VH2: Dell PowerEdge R640, 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz256, 256 GB RAM
VW1: Dell PowerEdge R350,1x Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz, 32 GB RAMSoftware
XCP-ng 8.2.1 installed on all 3 hosts. All hosts updated 3 weeks back using yum update.
XO from sources updated to commit 6fcb8.Configuration
All the 3 hosts are added to a pool.
I have created XOSTOR shared storage using disks from all 3 hosts.
Enabled HA on pool, used XOSTOR storage as hearbeat SR.
Created few Windows Server 2002 and Ubuntu 24.04 VMs.
Enabled HA on some of the VMs (best-effort + restart). Made sure that sum of RAM on HA enabled VMs is less than 32 GB (it is 20 GB) to account for the smallest host.
Checked max hosts failure that can be tolerated by running:[12:31 xcp-ng-vh1 ~]# xe pool-ha-compute-max-host-failures-to-tolerate 2
Test Case
Power off VW1 from IDRAC (Power Off System)
Expected Output: The 2 Ubuntu VMs running on VW1 will be migrated to surviving hosts.
Observed Output: After VW1 is powered of, other 2 surviving hosts in the cluster get rebooted. Have repeated this test case many times and same behaviour is observed.Disabling HA on the pool and repeating the test case does not exhibit the same behaviour. When VW1 is powered off, other hosts are unaffected.
Anyone have any idea why this can be happening?
Thanks.
-
If the other hosts are rebooting, it means the storage heartbeat is failing for all hosts. It's really hard to answer "like this", without reading literally tons of logs. This might be a pretty complex problem to solve.
We can try to reproduce internally though.
-
@ha_tu_su said in Other 2 hosts reboot when 1 host in HA enabled pool is powered off:
I have created XOSTOR shared storage using disks from all 3 hosts.
Can you elaborate on how you achieved this and what settings you used?
-
@olivierlambert
This doesn't happen when any of the other 2 hosts are individually rebooted for the same test.
Also I had been seeing this issue yesterday and today multiple times. After writing the post, I repeated same test for other 2 hosts and then one more time for VW1. This time everything went as expected.
I will give it one more round on Monday and update the post. -
@Danp
I followed the commands for XOSTOR thin provisioning from the XOSTOR hyper convergence thread.For the storage network I am experimenting with linstor storage paths. Each node has 3 NICs defined using linstor commands - deflt, strg1 and strg2. Then I create specific paths between nodes using linstor commands.
VH1 strg1 nic -> VH2 strg2 nic
VH2 strg1 nic -> VW1 strg2 nic
VW1 strg1 nic -> VH2 strg2 nic
All these NICs are 10G. Each line above is a separate network with /30 subnet. The idea is basically connecting 3 hosts in 'mesh' storage network.Sorry for not being clear with commands. Typing from my phone without access to setup. I will be able to give a more clear picture on Monday.
Hope you guys have a great Sunday.
-
@ha_tu_su Okay so this is one host specifically that when cutting it, it breaks the other two, right?
-
@olivierlambert Yes.
-
@ha_tu_su
Here are the details, albeit little late than what I had promised. Mondays are...not that great.Below commands were executed on all 3 hosts after installation of xcp-ng.
yum update wget https://gist.githubusercontent.com/Wescoeur/7bb568c0e09e796710b0ea966882fcac/raw/052b3dfff9c06b1765e51d8de72c90f2f90f475b/gistfile1.txt -O install && chmod +x install ./install --disks /dev/sdd --thin --force
Then XO was installed on one of the hosts and a pool was created which consisted of 3 hosts. Before that NIC renumbering was done on VW1 was changed to match the NIC numbers for other 2 hosts.
Then XOSTOR SR was created by executing follwoing on master host:
xe sr-create type=linstor name-label=XOSTOR host-uuid=<MASTER_UUID> device-config:group-name=linstor_group/thin_device device-config:redundancy=2 shared=true device-config:provisioning=thin
Then on host which is linstor controller below commands were executed. Each of the network has a /30 subnet.
linstor node interface create xcp-ng-vh1 strg1 192.168.255.1 linstor node interface create xcp-ng-vh1 strg2 192.168.255.10 linstor node interface create xcp-ng-vh2 strg1 192.168.255.5 linstor node interface create xcp-ng-vh2 strg2 192.168.255.2 linstor node interface create xcp-ng-vw1 strg1 192.168.255.9 linstor node interface create xcp-ng-vw1 strg2 192.168.255.6 linstor node-connection path create xcp-ng-vh1 xcp-ng-vh2 strg_path strg1 strg2 linstor node-connection path create xcp-ng-vh2 xcp-ng-vw1 strg_path strg1 strg2 linstor node-connection path create xcp-ng-vw1 xcp-ng-vh1 strg_path strg1 strg2
After this HA was enabled on the pool by executing below commands on master host:
xe pool-ha-enable heartbeat-sr-uuids=<XOSTOR_SR_UUID> xe pool-param-set ha-host-failures-to-tolerate=2 uuid=<POOL_UUID>
After this some test VMs were created as mentioned in Original Post. Host failure case works as expected for VH1 and VH2 host. For VW1 when it is switched off, VH1 and VH2 also reboot.
Let me know if any other information is required.
Thanks.