Working out why HA activated

bpbp-boop

Hi we had an issue over the weekend, 2 of the hosts in a 5 host pool rebooted due to HA activating and I'm trying to work out what caused it.

We're running XCP-NG 8.1 and heartbeat / shared storage is NFS on FreeNAS

On the master I see this

Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [ L  A ]; c1a44772-22fd-428e-ba96-958467d836a1 [*LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L  A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L  A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L  A ];
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Processing warnings
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Done with warnings
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] restart_auto_run_vms called
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Setting all VMs running or paused to Halted
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Protected VMs: (huge list of VMs)

and on one of the hosts that restarted I found these weird log lines

Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [*L  A ]; c1a44772-22fd-428e-ba96-958467d836a1 [ LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L  A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L  A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L  A ];
Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Processing warnings
Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Done with warnings
Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] The node we think is the master is still alive and marked as master; this is OK
Sep 27 01:28:44 compute-014 xapi: [ info||40018 UNIX /var/lib/xcp/xapi||cli] xe host-ha-xapi-healthcheck username=root password=(omitted)
Sep 27 01:28:44 compute-014 xapi: [debug||40018 UNIX /var/lib/xcp/xapi|session.slave_local_login_with_password D:7f8d3b61600e|xapi] Add session to local storage
Sep 27 01:28:46 compute-014 xcp-rrdd: [ info||7 ||rrdd_main] memfree has changed to 81024 in domain 21
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 ||xenops_server] Received an event on managed VM ede7422e-e34f-d547-e27d-063aa20395b7
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 |queue|xenops_server] Queue.push ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"] onto ede7422e-e34f-d547-e27d-063aa20395b7:[  ]
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] Queue.pop returned ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"]
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] Task 110715 reference events: ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"]
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM ede7422e-e34f-d547-e27d-063aa20395b7 is not requesting any attention
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM_DB.signal ede7422e-e34f-d547-e27d-063aa20395b7
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|task_server] Task 110715 completed; duration = 0
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] TASK.signal 110715 (object deleted)
Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Processing event: ["Vm","ede7422e-e34f-d547-e27d-063aa20395b7"]
Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenops event on VM ede7422e-e34f-d547-e27d-063aa20395b7
Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||196565 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops_server] VM.stat ede7422e-e34f-d547-e27d-063aa20395b7
Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: processing event for VM ede7422e-e34f-d547-e27d-063aa20395b7
Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Supressing VM.allowed_operations update because guest_agent data is largely the same
Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: Updating VM ede7422e-e34f-d547-e27d-063aa20395b7 domid 21 guest_agent
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Sep 27 01:35:45 compute-014 xcp-rrdd: [debug||0 ||rrdd_main] Reading configuration file ..

The switch they're all in doesn't have anything in the logs so I don't think any ports flapped. Anywhere else I can check to see what happened on XCP?

Thanks

bpbp-boop

Oh it looks like FreeNAS did a pool scrub around that time. I wonder if I need to adjust HA so it doesn't panic when this happens

olivierlambert

You can set a larger HA timeout on the heartbeat if I remember correctly.

Also, HA might use a dedicated file, maybe /var/log/xha.log, so interesting info can be there.

bpbp-boop

Looks like there was some kind of fight

Sep 27 01:28:39 AEST 2020 [warn] Host (0) and the local host do not agre on the view to the pool membership.
[...]
Sep 27 01:28:39 AEST 2020 [debug] FH: All hosts now have consistent view to the pool membership.
**Sep 27 01:28:45 AEST 2020 [debug] FH: I have won.**
Sep 27 01:28:45 AEST 2020 [debug] SM: Node (0) will be removed from liveset.

Any idea where to change the HA timeouts please?

olivierlambert

Eg: xe pool-ha-enable heartbeat-sr-uuids=<SR-UUID> ha-config:timeout=90