Working out why HA activated
-
Hi we had an issue over the weekend, 2 of the hosts in a 5 host pool rebooted due to HA activating and I'm trying to work out what caused it.
We're running XCP-NG 8.1 and heartbeat / shared storage is NFS on FreeNAS
On the master I see this
Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [ L A ]; c1a44772-22fd-428e-ba96-958467d836a1 [*LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L A ]; Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Processing warnings Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha] Done with warnings Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] restart_auto_run_vms called Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Setting all VMs running or paused to Halted Sep 27 01:27:44 compute-013 xapi: [debug||33 ha_monitor||xapi_ha_vm_failover] Protected VMs: (huge list of VMs)
and on one of the hosts that restarted I found these weird log lines
Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Liveset: online c179beef-47fe-44b1-98a1-cbc6fc5dca2d [*L A ]; c1a44772-22fd-428e-ba96-958467d836a1 [ LM A ]; e6dcb3fb-484b-494a-a232-fbb5bd1ffc5a [ L A ]; e30cdba9-f3e8-497a-8996-d794c7c19504 [ L A ]; 2f9779ca-6fd8-48a1-892d-12d0adc1ce86 [ L A ]; Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Processing warnings Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] Done with warnings Sep 27 01:28:43 compute-014 xapi: [debug||6 ha_monitor|HA monitor D:55db2d949753|xapi_ha] The node we think is the master is still alive and marked as master; this is OK Sep 27 01:28:44 compute-014 xapi: [ info||40018 UNIX /var/lib/xcp/xapi||cli] xe host-ha-xapi-healthcheck username=root password=(omitted) Sep 27 01:28:44 compute-014 xapi: [debug||40018 UNIX /var/lib/xcp/xapi|session.slave_local_login_with_password D:7f8d3b61600e|xapi] Add session to local storage Sep 27 01:28:46 compute-014 xcp-rrdd: [ info||7 ||rrdd_main] memfree has changed to 81024 in domain 21 Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 ||xenops_server] Received an event on managed VM ede7422e-e34f-d547-e27d-063aa20395b7 Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||4 |queue|xenops_server] Queue.push ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"] onto ede7422e-e34f-d547-e27d-063aa20395b7:[ ] Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] Queue.pop returned ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"] Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] Task 110715 reference events: ["VM_check_state","ede7422e-e34f-d547-e27d-063aa20395b7"] Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM ede7422e-e34f-d547-e27d-063aa20395b7 is not requesting any attention Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|xenops_server] VM_DB.signal ede7422e-e34f-d547-e27d-063aa20395b7 Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 |events|task_server] Task 110715 completed; duration = 0 Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||16 ||xenops_server] TASK.signal 110715 (object deleted) Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Processing event: ["Vm","ede7422e-e34f-d547-e27d-063aa20395b7"] Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenops event on VM ede7422e-e34f-d547-e27d-063aa20395b7 Sep 27 01:28:46 compute-014 xenopsd-xc: [debug||196565 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops_server] VM.stat ede7422e-e34f-d547-e27d-063aa20395b7 Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: processing event for VM ede7422e-e34f-d547-e27d-063aa20395b7 Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] Supressing VM.allowed_operations update because guest_agent data is largely the same Sep 27 01:28:46 compute-014 xapi: [debug||245 |org.xen.xapi.xenops.classic events D:66ada224147d|xenops] xenopsd event: Updating VM ede7422e-e34f-d547-e27d-063aa20395b7 domid 21 guest_agent ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ Sep 27 01:35:45 compute-014 xcp-rrdd: [debug||0 ||rrdd_main] Reading configuration file ..
The switch they're all in doesn't have anything in the logs so I don't think any ports flapped. Anywhere else I can check to see what happened on XCP?
Thanks
-
Oh it looks like FreeNAS did a pool scrub around that time. I wonder if I need to adjust HA so it doesn't panic when this happens
-
You can set a larger HA timeout on the heartbeat if I remember correctly.
Also, HA might use a dedicated file, maybe
/var/log/xha.log
, so interesting info can be there. -
Looks like there was some kind of fight
Sep 27 01:28:39 AEST 2020 [warn] Host (0) and the local host do not agre on the view to the pool membership. [...] Sep 27 01:28:39 AEST 2020 [debug] FH: All hosts now have consistent view to the pool membership. **Sep 27 01:28:45 AEST 2020 [debug] FH: I have won.** Sep 27 01:28:45 AEST 2020 [debug] SM: Node (0) will be removed from liveset.
Any idea where to change the HA timeouts please?
-
Eg:
xe pool-ha-enable heartbeat-sr-uuids=<SR-UUID> ha-config:timeout=90