XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Recovery from lost node in HA

    Scheduled Pinned Locked Moved XOSTOR
    3 Posts 2 Posters 33 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • H Offline
      henri9813
      last edited by

      Hello,

      I have a XCP-NG 8.3 pool running 3 hosts with XOSTOR in a 3 replicas with HA enabled.

      This setup should permit to lose up to 2 nodes without dataloss

      Initial informations:

      • The linstor controller was on node 1.
      • pool master was node 2
      • Satellite are running in all nodes.

      I was able to migrate VDI on XOSTOR successfuly ( even if when i start a transfert into xostor, i need to wait ~1 minute before the transfert really start ( i see that in XO ).

      In my first tests, i will shut node 3 (which is neither master, not linstor controller )

      For my first test, i didn't want to kill the linster controller host / pool master immediately, it should be my second test / third test )

      I stopped node 3 ( poweroff from IPMI ).

      However, then entire pool was dead.

      In xensource.log of all remaining nodes ( node 1, and node 2 ), i can see:

      Jul  5 15:32:20 node2 xapi: [debug||0 |Checking HA configuration D:9b97e277d80e|helpers] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  exited with code 8 [stdout = ''; stderr = 'Sat Jul  5 15:32:20 CEST 2025 ha_start_daemon: the HA daemon stopped without forming a liveset (8)\x0A']
      Jul  5 15:32:20 node2 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  returned MTC_EXIT_CAN_NOT_ACCESS_STATEFILE (State-File is inaccessible)
      Jul  5 15:32:20 gco-002-rbx-002 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] ha_start_daemon failed with MTC_EXIT_CAN_NOT_ACCESS_STATEFILE: will contact existing master and check if HA is still enabled
      

      However, the storage layer was ok

      [15:33 node1 linstor-controller]# linstor node list
      ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
      ┊ Node                               ┊ NodeType ┊ Addresses               ┊ State                                        ┊
      ╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
      ┊ h1 ┊ COMBINED ┊ 192.168.1.1:3366 (PLAIN) ┊ Online                                       ┊
      ┊ h2 ┊ COMBINED ┊ 192.168.1.2:3366 (PLAIN) ┊ Online                                       ┊
      ┊ h3 ┊ COMBINED ┊ 192.168.1.3:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2025-07-05 16:33:42) ┊
      ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
      

      Volumes was also OK using linstor volume list

      [15:33 r1 linstor-controller]# linstor volume list
      ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
      ┊ Node                               ┊ Resource                                        ┊ StoragePool                      ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
      ╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
      ┊ r1 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 52.74 MiB ┊ InUse  ┊ UpToDate ┊
      ┊ r2 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊ Unused ┊ UpToDate ┊
      ┊ r3 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊        ┊  Unknown ┊
      

      ( i didn't put the entire list of volumes, in writing my post, i'm feel a bit stupid to don't save the entire output ).

      I finally solved my issue by re-upping node3 which promote itself as master, but i need to perform this test again because the result is not the expected one.

      Did i do something wrong ?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        The issue is the HA cannot write in the statefile 🤔 Have you changed the timeout duration for the HA?

        H 1 Reply Last reply Reply Quote 0
        • H Offline
          henri9813 @olivierlambert
          last edited by

          @olivierlambert No,

          For once, i followed the installation step carefully ^^'

          1 Reply Last reply Reply Quote 0
          • First post
            Last post