Unable to enable HA on a XCP-ng 8.2.1 Compute Pool
-
Hi everyone,
I've encountered an issue i've never seen before where trying to enable HA on a compute pool fails with the error "INTERNAL_ERROR(Not_found). NTP is enabled on all the computes and all the computes are already in a pool. I have updated all the hosts in the pool, just to rule out a pending update but no change.
Below is the log output from the xensource log file from the poolmaster.Mar 12 15:56:49 compute-01 xapi: [error||472789 /var/lib/xcp/xapi||backtrace] host.ha_join_liveset R:aa412db4d158 failed with exception Server_error(INTERNAL_ERROR, [ Not_found ]) Mar 12 15:57:04 compute-01 xapi: [error||472469 HTTPS 172.28.18.83->:::80|pool.enable_ha R:c2beab627da0|xapi_ha] Caught exception while calling Host.ha_join_liveset: 'compute-04' ('OpaqueRef:07de98bb-5cf2-4cd4-bda6-becbc843f413') INTERNAL_ERROR: [ Not_found ] Mar 12 15:57:04 compute-01 xapi: [error||472469 HTTPS 172.28.18.83->:::80|pool.enable_ha R:c2beab627da0|xapi_ha] Attempting to disable HA pool-wide Mar 12 15:57:04 compute-01 xapi: [debug||472469 HTTPS 172.28.18.83->:::80|pool.enable_ha R:c2beab627da0|xapi_ha] Disabling HA on the Pool Mar 12 15:57:04 compute-01 xapi: [debug||472469 :::80||xapi_ha] Disabling HA, so also disabling writing to redo-log Mar 12 15:57:04 compute-01 xapi: [debug||472469 :::80||helpers] about to call script: /usr/libexec/xapi/cluster-stack/xhad/ha_query_liveset Mar 12 15:57:11 compute-01 xapi: [debug||472469 :::80||xapi_ha] Caught exception while enabling HA: INTERNAL_ERROR: [ Not_found ] Mar 12 15:57:11 compute-01 xapi: [error||472469 :::80||backtrace] pool.enable_ha R:c2beab627da0 failed with exception Server_error(INTERNAL_ERROR, [ Not_found ]) Mar 12 15:57:11 compute-01 xapi: [error||472469 :::80||backtrace] 1/14 xapi Raised at file ocaml/xapi/xapi_ha.ml, line 1972 Mar 12 15:57:11 compute-01 xapi: [error||472469 :::80||backtrace] 4/14 xapi Called from file ocaml/xapi/xapi_ha.ml, line 1918 Mar 12 15:57:11 compute-01 xapi: [error||472469 :::80||backtrace] 5/14 xapi Called from file ocaml/xapi/xapi_ha.ml, line 2026
If anyone has any pointers on how to resolve this, it would be appreciated.
2025-03-12T12_57_11.614Z - XO.txt -
@Denson I also have a shared NFS SR already configured on the pool.
-
Would you have an halted or removed host in your pool?
-
@olivierlambert I added a compute to the pool. It added successfully but HA won't come up.
-
Check the pool/host view to double check everyone is visible and correctly connected. Same for all shared SR in that pool.
-
@olivierlambert The pool is showing all hosts as connected, same for all the shared SRs. There are even vms running on the new compute host. Only thing not working is enabling HA.
One of the SRs is showing that HA is configured. Should that be the case when HA is not enabled yet?
-
This only shows which SR will be used with HA, nothing more.
The issue seems to come from the host "compute-04" when it's trying to enable HA. I would check this host log more in details.
-
@olivierlambert Notable error on that compute is that it can't find the UUID of the pool master, for some reason.
Mar 13 14:12:00 compute-04 xapi: [error||16119 HTTPS 172.21.17.96->:::80|host.ha_join_liveset R:c44ae729771f|xapi_ha] Failed to find the UUID address of host with address 172.21.18.96 Mar 13 14:12:00 compute-04 xapi: [error||16119 :::80||backtrace] host.ha_join_liveset R:c44ae729771f failed with exception Not_found Mar 13 14:12:00 compute-04 xapi: [error||16119 :::80||backtrace] 1/12 xapi Raised at file hashtbl.ml, line 194
172.21.17.96 is the management IP of the pool master(compute-01) while 172.21.18.96 is the IP for the same host, used as a Storage interface.
-
On this problematic host, can you do a simple command like "xe vm-list"? Does it work?
-
@Denson Are all hosts properly time synchronized to NTP? Make sure they are all within reasonable limits of each other.
Might be a network thing -- are all interfaces configured alike on all hosts? Can the hosts all ping each other?