Unable to enable High Availability - INTERNAL_ERROR(Not_found)

tjkreidl

@olivierlambert Good idea. Also, they should make sure all hosts are at the same update/patch levels, the network is set up properly among the three or more hosts, there is a compatible HA shared storage properly set up, etc.
You folks have a good guide at: https://docs.xcp-ng.org/management/ha/

jmannik

Well this is what im getting now:

{
  "id": "0mhbgkupy",
  "properties": {
    "method": "pool.enableHa",
    "params": {
      "pool": "213186d2-e3ba-154f-d371-4122388deb83",
      "heartbeatSrs": [
        "381caeb2-5ad9-8924-365d-4b130c67c064"
      ],
      "configuration": {}
    },
    "name": "API call: pool.enableHa",
    "userId": "71d48027-d471-4b01-83f9-830df4279f7e",
    "type": "api.call"
  },
  "start": 1761709884550,
  "status": "failure",
  "updatedAt": 1761709923544,
  "end": 1761709923544,
  "result": {
    "code": "INTERNAL_ERROR",
    "params": [
      "unable to gather the coordinator's UUID: Not_found"
    ],
    "call": {
      "duration": 38993,
      "method": "pool.enable_ha",
      "params": [
        "* session id *",
        [
          "OpaqueRef:a83a416f-c97d-1ed8-c7fc-213af89b8f86"
        ],
        {}
      ]
    },
    "message": "INTERNAL_ERROR(unable to gather the coordinator's UUID: Not_found)",
    "name": "XapiError",
    "stack": "XapiError: INTERNAL_ERROR(unable to gather the coordinator's UUID: Not_found)\n    at Function.wrap (file:///opt/xen-orchestra/packages/xen-api/_XapiError.mjs:16:12)\n    at file:///opt/xen-orchestra/packages/xen-api/transports/json-rpc.mjs:38:21\n    at runNextTicks (node:internal/process/task_queues:65:5)\n    at processImmediate (node:internal/timers:453:9)\n    at process.callbackTrampoline (node:internal/async_hooks:130:17)"
  }
}

olivierlambert

That's better @psafont now we now we are missing an UUID somewhere?

psafont

@jmannik
So the problem goes like this:

HA uses a local-only database to avoid depending on the database
This database contains a mapping from UUID to the IP host_address for all hosts in an HA cluster / pool. This information should be gathered right before HA is enabled, from the normal database.
When trying to enable HA, the host fetches the coordinator's address from the filesystem. Then it uses the previous mapping and the coordinator address to find the coordinator's UUID. This step fails.

I'm not sure what has actually happening, but some scenarios come to mind:

XO isn't calling the API function Host.preconfigure_ha, which means the local database is not created (unlikely)
The coordinator's address has somehow changed between the local database being written and the HA being enabled

things to check out:

inspect the values that the failing host has about the host_address of the coordinator / master host, both on:
1. the normal database. You can SSH into the failing host and run the following command, replacinf POOL_UUID with the actual uuid, this can be done deleting POOL_UUID , placing the cursor after the = and pressing tab twice.

xe pool-param-get uuid=POOL_UUID param-name=master | xargs -I _ xe host-param-get uuid=_ param-name=address

and the pool role file, similar to the previous command, SSH in the failing host and run

cat /etc/xensource/pool.conf

Let us know how it goes. If the IPs don't match, there's a problem with the configuration of the member, and otherwise it's because the local database is outdated and should be refreshed before enabling HA. I don't know how XO handles it.

olivierlambert

@psafont I'm not sure to follow, I don't remember seeing any documented endpoint related to prepare HA

psafont

@olivierlambert The call is indeed hidden from the docs, and only callable from inside a pool... it's called as part as Pool.enable_ha

olivierlambert

So we probably need to tell XO team the "right way" to enable HA because there's no way to know from "outside"

psafont

@olivierlambert

So we probably need to tell XO team the "right way" to enable HA because there's no way to know from "outside it's not meant to, xapi makes the call automatically.

I don't think so, it's xapi's responsibility to make that call

jmannik

@psafont
[22:13 vmhost13 ~]# xe pool-param-get uuid=213186d2-e3ba-154f-d371-4122388deb83 param-name=master | xargs -I _ xe host-param-get uuid=_ param-name=address
192.168.10.13
[22:13 vmhost13 ~]# cat /etc/xensource/pool.conf
master[22:14 vmhost13 ~]#

psafont

@jmannik Could you collect the file contents of /etc/xensource/pool.conf from all the other hosts? The command is failing in one of them, not on the master host.

jmannik

[22:27 vmhost12 ~]# xe pool-param-get uuid=213186d2-e3ba-154f-d371-4122388deb83 param-name=master | xargs -I _ xe host-param-get uuid=_ param-name=address
192.168.10.13
[22:27 vmhost12 ~]# cat /etc/xensource/pool.conf
slave:192.168.30.13[22:27 vmhost12 ~]#

[22:27 vmhost11 ~]# xe pool-param-get uuid=213186d2-e3ba-154f-d371-4122388deb83  param-name=master | xargs -I _ xe host-param-get uuid=_ param-name=address
192.168.10.13
[22:28 vmhost11 ~]# cat /etc/xensource/pool.conf
slave:192.168.30.13[22:28 vmhost11 ~]#

I think I see where the issue is, not sure how to solve it though

psafont

@jmannik The IPs match, and now I don't have an explanation on why is this happening, I'll take another look at the codepath, but that'll have to take a while, as work is piling up

jmannik

Ok, so in this process I have come across a re-occurring issue I have had with XCP-NG where it will have the wrong order for the ethernet interfaces.
Each of my hosts has a 1gbit interface onboard, then a 4 port 10gbit card
It SHOULD be ordering the interfaces like so:
ETH0 1gbit
ETH1 10gbit
ETH2 10gbit
ETH3 10gbit
ETH4 10gbit

But it will randomly decide upon install (VMHost11 was recently rebuilt due to an id10t pebkac issue) to order them like below for no apparent reason:

ETH0 10gbit
ETH1 1gbit
ETH2 10gbit
ETH3 10gbit
ETH4 10gbit

And to be able to re-order the interfaces its just a lot more difficult that I think it should be.

jmannik

@psafont said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

@jmannik The IPs match, and now I don't have an explanation on why is this happening, I'll take another look at the codepath, but that'll have to take a while, as work is piling up

Ahh but they dont match.
VMHost13 lists 192.168.10.13
VMHost12 and VMHost11 list 192.168.30.13

psafont

@jmannik ah, indeed. Do you know which server / interface holds the IP 192.168.30.13? I suspect is still VMHost13, but a different interface.

Until the members have configured their master as 192.168.30.13, you'll have this error. This can be done by a call, but since it's a delicate operation, it's better if there are no operations running on the pool. SSH into the VMHost13, and run

xe host-list name-label=VMHost13 --minimal | xargs -I _ xe pool-designate-new-master host-uuid=_

This should write the new IP to the files of all the pool members and stop blocking this issue from enabling HA

jmannik

@psafont Would designating a new pool master do the same thing?
I ran the above command and its had no effect

jmannik

said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

@psafont Would designating a new pool master do the same thing?
I ran the above command and its had no effect

Well, I tried changing the pool master and when VMHost11 was the master I was able to enable HA.
Switching back to VMHost13 as the master now so will see how that goes

jmannik

said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

@psafont Would designating a new pool master do the same thing?
I ran the above command and its had no effect

Well, I tried changing the pool master and when VMHost11 was the master I was able to enable HA.
Switching back to VMHost13 as the master now so will see how that goes

Everything is working as expected/hoped.

So for anyone reading through this and wants a TL;DR

Issue was related to the pool master setting, changing the pool master to a different host and then back to the original fixed the incorrect settings allowing HA to be enabled