XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Working out why HA activated

    Scheduled Pinned Locked Moved Compute
    5 Posts 2 Posters 452 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B Offline
      bpbp-boop
      last edited by

      Hi we had an issue over the weekend, 2 of the hosts in a 5 host pool rebooted due to HA activating and I'm trying to work out what caused it.

      We're running XCP-NG 8.1 and heartbeat / shared storage is NFS on FreeNAS

      On the master I see this

      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [ L  A ]; c1a44772-22fd-428e-ba96-958467d836a1 [*LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L  A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L  A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L  A ];
      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Processing warnings
      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Done with warnings
      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] restart_auto_run_vms called
      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Setting all VMs running or paused to Halted
      Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Protected VMs: (huge list of VMs)
      

      and on one of the hosts that restarted I found these weird log lines

      Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [*L  A ]; c1a44772-22fd-428e-ba96-958467d836a1 [ LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L  A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L  A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L  A ];
      Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Processing warnings
      Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Done with warnings
      Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] The node we think is the master is still alive and marked as master; this is OK
      Sep 27 01:28:44 compute-014 xapi: [ info||40018 UNIX /var/lib/xcp/xapi||cli] xe host-ha-xapi-healthcheck username=root password=(omitted)
      Sep 27 01:28:44 compute-014 xapi: [debug||40018 UNIX /var/lib/xcp/xapi|session.slave_local_login_with_password D:7f8d3b61600e|xapi] Add session to local storage
      Sep 27 01:28:46 compute-014 xcp-rrdd: [ info||7 ||rrdd_main] memfree has changed to 81024 in domain 21
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 ||xenops_server] Received an event on managed VM ede7422e-e34f-d547-e27d-063aa20395b7
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 |queue|xenops_server] Queue.push ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"] onto ede7422e-e34f-d547-e27d-063aa20395b7:[  ]
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] Queue.pop returned ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"]
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] Task 110715 reference events: ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"]
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM ede7422e-e34f-d547-e27d-063aa20395b7 is not requesting any attention
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM_DB.signal ede7422e-e34f-d547-e27d-063aa20395b7
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|task_server] Task 110715 completed; duration = 0
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] TASK.signal 110715 (object deleted)
      Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Processing event: ["Vm","ede7422e-e34f-d547-e27d-063aa20395b7"]
      Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenops event on VM ede7422e-e34f-d547-e27d-063aa20395b7
      Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||196565 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops_server] VM.stat ede7422e-e34f-d547-e27d-063aa20395b7
      Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: processing event for VM ede7422e-e34f-d547-e27d-063aa20395b7
      Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Supressing VM.allowed_operations update because guest_agent data is largely the same
      Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: Updating VM ede7422e-e34f-d547-e27d-063aa20395b7 domid 21 guest_agent
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
      Sep 27 01:35:45 compute-014 xcp-rrdd: [debug||0 ||rrdd_main] Reading configuration file ..
      

      The switch they're all in doesn't have anything in the logs so I don't think any ports flapped. Anywhere else I can check to see what happened on XCP?

      Thanks

      1 Reply Last reply Reply Quote 0
      • B Offline
        bpbp-boop
        last edited by

        Oh it looks like FreeNAS did a pool scrub around that time. I wonder if I need to adjust HA so it doesn't panic when this happens

        1 Reply Last reply Reply Quote 0
        • olivierlambertO Offline
          olivierlambert Vates 🪐 Co-Founder CEO
          last edited by olivierlambert

          You can set a larger HA timeout on the heartbeat if I remember correctly.

          Also, HA might use a dedicated file, maybe /var/log/xha.log, so interesting info can be there.

          1 Reply Last reply Reply Quote 0
          • B Offline
            bpbp-boop
            last edited by

            Looks like there was some kind of fight 😂

            Sep 27 01:28:39 AEST 2020 [warn] Host (0) and the local host do not agre on the view to the pool membership.
            [...]
            Sep 27 01:28:39 AEST 2020 [debug] FH: All hosts now have consistent view to the pool membership.
            **Sep 27 01:28:45 AEST 2020 [debug] FH: I have won.**
            Sep 27 01:28:45 AEST 2020 [debug] SM: Node (0) will be removed from liveset.
            

            Any idea where to change the HA timeouts please?

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Eg: xe pool-ha-enable heartbeat-sr-uuids=<SR-UUID> ha-config:timeout=90

              1 Reply Last reply Reply Quote 1
              • First post
                Last post