Recovery from lost node in HA

henri9813

Hello,

I have a XCP-NG 8.3 pool running 3 hosts with XOSTOR in a 3 replicas with HA enabled.

This setup should permit to lose up to 2 nodes without dataloss

Initial informations:

The linstor controller was on node 1.
pool master was node 2
Satellite are running in all nodes.

I was able to migrate VDI on XOSTOR successfuly ( even if when i start a transfert into xostor, i need to wait ~1 minute before the transfert really start ( i see that in XO ).

In my first tests, i will shut node 3 (which is neither master, not linstor controller )

For my first test, i didn't want to kill the linster controller host / pool master immediately, it should be my second test / third test )

I stopped node 3 ( poweroff from IPMI ).

However, then entire pool was dead.

In xensource.log of all remaining nodes ( node 1, and node 2 ), i can see:

Jul  5 15:32:20 node2 xapi: [debug||0 |Checking HA configuration D:9b97e277d80e|helpers] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  exited with code 8 [stdout = ''; stderr = 'Sat Jul  5 15:32:20 CEST 2025 ha_start_daemon: the HA daemon stopped without forming a liveset (8)\x0A']
Jul  5 15:32:20 node2 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  returned MTC_EXIT_CAN_NOT_ACCESS_STATEFILE (State-File is inaccessible)
Jul  5 15:32:20 gco-002-rbx-002 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] ha_start_daemon failed with MTC_EXIT_CAN_NOT_ACCESS_STATEFILE: will contact existing master and check if HA is still enabled

However, the storage layer was ok

[15:33 node1 linstor-controller]# linstor node list
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node                               ┊ NodeType ┊ Addresses               ┊ State                                        ┊
╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ h1 ┊ COMBINED ┊ 192.168.1.1:3366 (PLAIN) ┊ Online                                       ┊
┊ h2 ┊ COMBINED ┊ 192.168.1.2:3366 (PLAIN) ┊ Online                                       ┊
┊ h3 ┊ COMBINED ┊ 192.168.1.3:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2025-07-05 16:33:42) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Volumes was also OK using linstor volume list

[15:33 r1 linstor-controller]# linstor volume list
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node                               ┊ Resource                                        ┊ StoragePool                      ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ r1 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 52.74 MiB ┊ InUse  ┊ UpToDate ┊
┊ r2 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊ Unused ┊ UpToDate ┊
┊ r3 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊        ┊  Unknown ┊

( i didn't put the entire list of volumes, in writing my post, i'm feel a bit stupid to don't save the entire output ).

I finally solved my issue by re-upping node3 which promote itself as master, but i need to perform this test again because the result is not the expected one.

Did i do something wrong ?

olivierlambert

The issue is the HA cannot write in the statefile Have you changed the timeout duration for the HA?

henri9813

@olivierlambert No,

For once, i followed the installation step carefully ^^'