Other 2 hosts reboot when 1 host in HA enabled pool is powered off

ha_tu_su

Hello,

Before I elaborate more on the problem below are some details of the setup for testing.

Hardware

VH1: Dell PowerEdge R640, 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz256, 256 GB RAM
VH2: Dell PowerEdge R640, 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz256, 256 GB RAM
VW1: Dell PowerEdge R350,1x Intel(R) Xeon(R) E-2378 CPU @ 2.60GHz, 32 GB RAM

Software

XCP-ng 8.2.1 installed on all 3 hosts. All hosts updated 3 weeks back using yum update.
XO from sources updated to commit 6fcb8.

Configuration

All the 3 hosts are added to a pool.
I have created XOSTOR shared storage using disks from all 3 hosts.
Enabled HA on pool, used XOSTOR storage as hearbeat SR.
Created few Windows Server 2002 and Ubuntu 24.04 VMs.
Enabled HA on some of the VMs (best-effort + restart). Made sure that sum of RAM on HA enabled VMs is less than 32 GB (it is 20 GB) to account for the smallest host.
Checked max hosts failure that can be tolerated by running:

[12:31 xcp-ng-vh1 ~]# xe pool-ha-compute-max-host-failures-to-tolerate
2

Test Case

Power off VW1 from IDRAC (Power Off System)
Expected Output: The 2 Ubuntu VMs running on VW1 will be migrated to surviving hosts.
Observed Output: After VW1 is powered of, other 2 surviving hosts in the cluster get rebooted. Have repeated this test case many times and same behaviour is observed.

Disabling HA on the pool and repeating the test case does not exhibit the same behaviour. When VW1 is powered off, other hosts are unaffected.

Anyone have any idea why this can be happening?

Thanks.

olivierlambert

If the other hosts are rebooting, it means the storage heartbeat is failing for all hosts. It's really hard to answer "like this", without reading literally tons of logs. This might be a pretty complex problem to solve.

We can try to reproduce internally though.

Danp

@ha_tu_su said in Other 2 hosts reboot when 1 host in HA enabled pool is powered off:

I have created XOSTOR shared storage using disks from all 3 hosts.

Can you elaborate on how you achieved this and what settings you used?

ha_tu_su

@olivierlambert
This doesn't happen when any of the other 2 hosts are individually rebooted for the same test.
Also I had been seeing this issue yesterday and today multiple times. After writing the post, I repeated same test for other 2 hosts and then one more time for VW1. This time everything went as expected.
I will give it one more round on Monday and update the post.

ha_tu_su

@Danp
I followed the commands for XOSTOR thin provisioning from the XOSTOR hyper convergence thread.

For the storage network I am experimenting with linstor storage paths. Each node has 3 NICs defined using linstor commands - deflt, strg1 and strg2. Then I create specific paths between nodes using linstor commands.
VH1 strg1 nic -> VH2 strg2 nic
VH2 strg1 nic -> VW1 strg2 nic
VW1 strg1 nic -> VH2 strg2 nic
All these NICs are 10G. Each line above is a separate network with /30 subnet. The idea is basically connecting 3 hosts in 'mesh' storage network.

Sorry for not being clear with commands. Typing from my phone without access to setup. I will be able to give a more clear picture on Monday.

Hope you guys have a great Sunday.

olivierlambert

@ha_tu_su Okay so this is one host specifically that when cutting it, it breaks the other two, right?

ha_tu_su

@olivierlambert Yes.

ha_tu_su

@ha_tu_su
Here are the details, albeit little late than what I had promised. Mondays are...not that great.

Below commands were executed on all 3 hosts after installation of xcp-ng.

yum update
wget https://gist.githubusercontent.com/Wescoeur/7bb568c0e09e796710b0ea966882fcac/raw/052b3dfff9c06b1765e51d8de72c90f2f90f475b/gistfile1.txt -O install && chmod +x install
./install --disks /dev/sdd --thin --force

Then XO was installed on one of the hosts and a pool was created which consisted of 3 hosts. Before that NIC renumbering was done on VW1 was changed to match the NIC numbers for other 2 hosts.

Then XOSTOR SR was created by executing follwoing on master host:

xe sr-create type=linstor name-label=XOSTOR host-uuid=<MASTER_UUID> device-config:group-name=linstor_group/thin_device device-config:redundancy=2 shared=true device-config:provisioning=thin

Then on host which is linstor controller below commands were executed. Each of the network has a /30 subnet.

linstor node interface create xcp-ng-vh1 strg1 192.168.255.1
linstor node interface create xcp-ng-vh1 strg2 192.168.255.10
linstor node interface create xcp-ng-vh2 strg1 192.168.255.5
linstor node interface create xcp-ng-vh2 strg2 192.168.255.2
linstor node interface create xcp-ng-vw1 strg1 192.168.255.9
linstor node interface create xcp-ng-vw1 strg2 192.168.255.6

linstor node-connection path create xcp-ng-vh1 xcp-ng-vh2 strg_path strg1 strg2
linstor node-connection path create xcp-ng-vh2 xcp-ng-vw1 strg_path strg1 strg2
linstor node-connection path create xcp-ng-vw1 xcp-ng-vh1 strg_path strg1 strg2

After this HA was enabled on the pool by executing below commands on master host:

xe pool-ha-enable heartbeat-sr-uuids=<XOSTOR_SR_UUID>
xe pool-param-set ha-host-failures-to-tolerate=2 uuid=<POOL_UUID>

After this some test VMs were created as mentioned in Original Post. Host failure case works as expected for VH1 and VH2 host. For VW1 when it is switched off, VH1 and VH2 also reboot.

Let me know if any other information is required.

Thanks.

henri9813

Hello @ha_tu_su ,

i have the same issue.

Did you find something ?

olivierlambert

It's impossible to answer right off the bat without knowing more in details what's going on. HA is a complex beast, and combined with HCI, requires a lot of knowledge to find what's causing your issue, between both xha and XOSTOR.

In other words, it is very demanding to analyze all the logs and trying to make sense of it. However, I can give you some clues to make sense of it:

The HA log is at /var/log/xha.log. When you shutdown a host, you should be able to watch (on each host) what the HA is deciding to do. My gut feeling: there's maybe a XOSTOR issue making the heartbeat SR being unavailable, and so all hosts will autofence
Then you need to understand the XOSTOR logs for why the cluster wasn't doing what's expected. My best advice: remove HA first, and only then investigate on XOSTOR. Kill on node (not the master) and check if your VMs are still able to start/snapshot/write inside.