XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Recovery from lost node in HA

    Scheduled Pinned Locked Moved XOSTOR
    3 Posts 2 Posters 450 Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • henri9813H Offline
      henri9813
      last edited by

      Hello,

      I have a XCP-NG 8.3 pool running 3 hosts with XOSTOR in a 3 replicas with HA enabled.

      This setup should permit to lose up to 2 nodes without dataloss

      Initial informations:

      • The linstor controller was on node 1.
      • pool master was node 2
      • Satellite are running in all nodes.

      I was able to migrate VDI on XOSTOR successfuly ( even if when i start a transfert into xostor, i need to wait ~1 minute before the transfert really start ( i see that in XO ).

      In my first tests, i will shut node 3 (which is neither master, not linstor controller )

      For my first test, i didn't want to kill the linster controller host / pool master immediately, it should be my second test / third test )

      I stopped node 3 ( poweroff from IPMI ).

      However, then entire pool was dead.

      In xensource.log of all remaining nodes ( node 1, and node 2 ), i can see:

      Jul  5 15:32:20 node2 xapi: [debug||0 |Checking HA configuration D:9b97e277d80e|helpers] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  exited with code 8 [stdout = ''; stderr = 'Sat Jul  5 15:32:20 CEST 2025 ha_start_daemon: the HA daemon stopped without forming a liveset (8)\x0A']
      Jul  5 15:32:20 node2 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] /usr/libexec/xapi/cluster-stack/xhad/ha_start_daemon  returned MTC_EXIT_CAN_NOT_ACCESS_STATEFILE (State-File is inaccessible)
      Jul  5 15:32:20 gco-002-rbx-002 xapi: [ warn||0 |Checking HA configuration D:9b97e277d80e|xapi_ha] ha_start_daemon failed with MTC_EXIT_CAN_NOT_ACCESS_STATEFILE: will contact existing master and check if HA is still enabled
      

      However, the storage layer was ok

      [15:33 node1 linstor-controller]# linstor node list
      ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
      ┊ Node                               ┊ NodeType ┊ Addresses               ┊ State                                        ┊
      ╞════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
      ┊ h1 ┊ COMBINED ┊ 192.168.1.1:3366 (PLAIN) ┊ Online                                       ┊
      ┊ h2 ┊ COMBINED ┊ 192.168.1.2:3366 (PLAIN) ┊ Online                                       ┊
      ┊ h3 ┊ COMBINED ┊ 192.168.1.3:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2025-07-05 16:33:42) ┊
      ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
      

      Volumes was also OK using linstor volume list

      [15:33 r1 linstor-controller]# linstor volume list
      ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
      ┊ Node                               ┊ Resource                                        ┊ StoragePool                      ┊ VolNr ┊ MinorNr ┊ DeviceName    ┊ Allocated ┊ InUse  ┊    State ┊
      ╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
      ┊ r1 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊ 52.74 MiB ┊ InUse  ┊ UpToDate ┊
      ┊ r2 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊ Unused ┊ UpToDate ┊
      ┊ r3 ┊ xcp-persistent-database                         ┊ xcp-sr-linstor_group_thin_device ┊     0 ┊    1000 ┊ /dev/drbd1000 ┊  6.99 MiB ┊        ┊  Unknown ┊
      

      ( i didn't put the entire list of volumes, in writing my post, i'm feel a bit stupid to don't save the entire output ).

      I finally solved my issue by re-upping node3 which promote itself as master, but i need to perform this test again because the result is not the expected one.

      Did i do something wrong ?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        The issue is the HA cannot write in the statefile 🤔 Have you changed the timeout duration for the HA?

        henri9813H 1 Reply Last reply Reply Quote 0
        • henri9813H Offline
          henri9813 @olivierlambert
          last edited by

          @olivierlambert No,

          For once, i followed the installation step carefully ^^'

          1 Reply Last reply Reply Quote 0

          Hello! It looks like you're interested in this conversation, but you don't have an account yet.

          Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

          With your input, this post could be even better 💗

          Register Login
          • First post
            Last post