Recovery from lost node in HA
-
Hello,
I have a XCP-NG 8.3 pool running 3 hosts with XOSTOR in a 3 replicas with HA enabled.
This setup should permit to lose up to 2 nodes without dataloss
Initial informations:
- The linstor controller was on node 1.
- pool master was node 2
- Satellite are running in all nodes.
I was able to migrate VDI on XOSTOR successfuly ( even if when i start a transfert into xostor, i need to wait ~1 minute before the transfert really start ( i see that in XO ).
In my first tests, i will shut node 3 (which is neither master, not linstor controller )
For my first test, i didn't want to kill the linster controller host / pool master immediately, it should be my second test / third test )
I stopped node 3 ( poweroff from IPMI ).
However, then entire pool was dead.
In
xensource.log
of all remaining nodes ( node 1, and node 2 ), i can see:Jul 5 15:32:20 node2 xapi: [debug||0 |Checking HA configuration D:9b97e277d80e|helpers] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon exited with code 8 [stdout = ''; stderr = 'Sat Jul 5 15:32:20 CEST 2025 ha_start_daemon: the HA daemon stopped without forming a liveset (8)\x0A'] Jul 5 15:32:20 node2 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon returned MTC_EXIT_CAN_NOT_ACCESS_STATEFILE (State-File is inaccessible) Jul 5 15:32:20 gco-002-rbx-002 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] ha_start_daemon failed with MTC_EXIT_CAN_NOT_ACCESS_STATEFILE: will contact existing master and check if HA is still enabled
However, the storage layer was ok
[15:33 node1 linstor-controller]# linstor node list ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ ┊ Node ┊ NodeType ┊ Addresses ┊ State ┊ ╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ ┊ h1 ┊ COMBINED ┊ 192.168.1.1:3366 (PLAIN) ┊ Online ┊ ┊ h2 ┊ COMBINED ┊ 192.168.1.2:3366 (PLAIN) ┊ Online ┊ ┊ h3 ┊ COMBINED ┊ 192.168.1.3:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2025-07-05 16:33:42) ┊ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Volumes was also OK using
linstor volume list
[15:33 r1 linstor-controller]# linstor volume list ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ ┊ Node ┊ Resource ┊ StoragePool ┊ VolNr ┊ MinorNr ┊ DeviceName ┊ Allocated ┊ InUse ┊ State ┊ ╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ ┊ r1 ┊ xcp-persistent-database ┊ xcp-sr-linstor_group_thin_device ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 52.74 MiB ┊ InUse ┊ UpToDate ┊ ┊ r2 ┊ xcp-persistent-database ┊ xcp-sr-linstor_group_thin_device ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 6.99 MiB ┊ Unused ┊ UpToDate ┊ ┊ r3 ┊ xcp-persistent-database ┊ xcp-sr-linstor_group_thin_device ┊ 0 ┊ 1000 ┊ /dev/drbd1000 ┊ 6.99 MiB ┊ ┊ Unknown ┊
( i didn't put the entire list of volumes, in writing my post, i'm feel a bit stupid to don't save the entire output ).
I finally solved my issue by re-upping node3 which promote itself as master, but i need to perform this test again because the result is not the expected one.
Did i do something wrong ?
-
The issue is the HA cannot write in the statefile
Have you changed the timeout duration for the HA?
-
@olivierlambert No,
For once, i followed the installation step carefully ^^'