XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Unable to enable High Availability - INTERNAL_ERROR(Not_found)

    Scheduled Pinned Locked Moved XCP-ng
    33 Posts 6 Posters 1.4k Views 5 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      jmannik @olivierlambert
      last edited by

      @olivierlambert

      [18:15 vmhost13 ~]# xe pool-ha-enable heartbeat-sr-uuids=381caeb2-5ad9-8924-365d-4b130c67c064
      The server failed to handle your request, due to an internal error. The given message may give details useful for debugging the problem.
      message: Not_found

      A psafontP 2 Replies Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        That's weird. Ping @Team-XAPI-Network and maybe directly @psafont

        1 Reply Last reply Reply Quote 0
        • A Offline
          andriy.sultanov Vates 🪐 XAPI & Network Team @jmannik
          last edited by

          @jmannik Please upload your /var/log/xensource.log from the time of the error, otherwise it's hard to see what went wrong

          J 1 Reply Last reply Reply Quote 1
          • psafontP Offline
            psafont Vates 🪐 XAPI & Network Team @jmannik
            last edited by

            @jmannik said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

            @olivierlambert

            [18:15 vmhost13 ~]# xe pool-ha-enable heartbeat-sr-uuids=381caeb2-5ad9-8924-365d-4b130c67c064
            The server failed to handle your request, due to an internal error. The given message may give details useful for debugging the problem.
            message: Not_found

            That message is created by an exception. It's commonly raised by List.find and List.assoc, in this case the exception wasn't caught.

            It's usually difficult to find out which one, since these functions are frequently used and catching the exception can happen in a caller of the function that uses it.

            Could you provide the xenserver.log, as Andriy has asked? Otherwise I don't think we'll be able to find the exact cause.

            1 Reply Last reply Reply Quote 0
            • J Offline
              jmannik @andriy.sultanov
              last edited by

              @andriy.sultanov @psafont
              https://drive.google.com/file/d/1aJyCYSAuRIBb0X-23gJ6ORtrHSciYH8a/view?usp=sharing
              Here is the log file

              psafontP 2 Replies Last reply Reply Quote 0
              • psafontP Offline
                psafont Vates 🪐 XAPI & Network Team @jmannik
                last edited by

                @jmannik said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

                @andriy.sultanov @psafont
                https://drive.google.com/file/d/1aJyCYSAuRIBb0X-23gJ6ORtrHSciYH8a/view?usp=sharing
                Here is the log file

                It's not crystal clear the condition that causes the exception, but I can see some unprotected exception being raised in that path host.ha_join_liveset when trying to recover the host uuid and it's not found. I'll investigate

                1 Reply Last reply Reply Quote 1
                • psafontP Offline
                  psafont Vates 🪐 XAPI & Network Team @jmannik
                  last edited by

                  @jmannik I have a test build that you can test, it will hopefully provide better error messages by raising an internal error with a reason.

                  The code is based on the newest builds, so I recommend updating to the latest version of XCP beforehand:

                  yum update
                  reboot
                  

                  Once that is done, the test packages can be installed by creating the file /etc/yum.repos.d/xcp-test.repo:

                  [xcp-ng-psafont1]
                  name=xcp-ng-psafont1
                  baseurl=https://koji.xcp-ng.org/repos/user/8/8.3/psafont1/x86_64/
                  enabled=0
                  gpgcheck=1
                  repo_gpgcheck=1
                  metadata_expire=0
                  gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-xcpng
                  

                  then updating the host using the test developer repo

                  yum update --enablerepo=xcp-ng-psafont1
                  

                  and finally restarting all the daemons

                  xe-toolstack-restart
                  

                  Note: the repository will only be available for a limited amount of time, after which I will repurpose it and delete the instructions so it's not used anymore by accident.

                  J 1 Reply Last reply Reply Quote 1
                  • tjkreidlT Offline
                    tjkreidl Ambassador
                    last edited by

                    Note also that if HA is turned on or off, the host must be restarted for that change to take effect, if I recall correctly.

                    J 1 Reply Last reply Reply Quote 0
                    • J Offline
                      jmannik @tjkreidl
                      last edited by jmannik

                      @tjkreidl This hasn't been my experience so far, enabling HA has just enabled HA, no reboot needed.

                      @psafont I am patching all my hosts now, will do the above test packages on Sunday Night (it is Friday afternoon at the time of this post)

                      nikadeN 1 Reply Last reply Reply Quote 0
                      • nikadeN Offline
                        nikade Top contributor @jmannik
                        last edited by

                        @jmannik said in Unable to enable High Availability - INTERNAL_ERROR(Not_found):

                        @tjkreidl This hasn't been my experience so far, enabling HA has just enabled HA, no reboot needed.

                        @psafont I am patching all my hosts now, will do the above test packages on Sunday Night (it is Friday afternoon at the time of this post)

                        Correct, no reboot needed to enable/disable HA.

                        tjkreidlT 1 Reply Last reply Reply Quote 0
                        • tjkreidlT Offline
                          tjkreidl Ambassador @nikade
                          last edited by

                          @nikade Interesting, as that at some point used to be the case, at least with XenServer!
                          I stand corrected and learned something new.

                          1 Reply Last reply Reply Quote 0
                          • J Offline
                            jmannik @psafont
                            last edited by

                            @psafont
                            That is done now, tried to enable HA again and it was unsuccessful, what would you like me to do now?

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              You should check the /var/log/xensource.log, it should provide a more explicit error message

                              tjkreidlT 1 Reply Last reply Reply Quote 0
                              • tjkreidlT Offline
                                tjkreidl Ambassador @olivierlambert
                                last edited by

                                @olivierlambert Good idea. Also, they should make sure all hosts are at the same update/patch levels, the network is set up properly among the three or more hosts, there is a compatible HA shared storage properly set up, etc.
                                You folks have a good guide at: https://docs.xcp-ng.org/management/ha/

                                1 Reply Last reply Reply Quote 0
                                • J Offline
                                  jmannik
                                  last edited by olivierlambert

                                  Well this is what im getting now:

                                  {
                                    "id": "0mhbgkupy",
                                    "properties": {
                                      "method": "pool.enableHa",
                                      "params": {
                                        "pool": "213186d2-e3ba-154f-d371-4122388deb83",
                                        "heartbeatSrs": [
                                          "381caeb2-5ad9-8924-365d-4b130c67c064"
                                        ],
                                        "configuration": {}
                                      },
                                      "name": "API call: pool.enableHa",
                                      "userId": "71d48027-d471-4b01-83f9-830df4279f7e",
                                      "type": "api.call"
                                    },
                                    "start": 1761709884550,
                                    "status": "failure",
                                    "updatedAt": 1761709923544,
                                    "end": 1761709923544,
                                    "result": {
                                      "code": "INTERNAL_ERROR",
                                      "params": [
                                        "unable to gather the coordinator's UUID: Not_found"
                                      ],
                                      "call": {
                                        "duration": 38993,
                                        "method": "pool.enable_ha",
                                        "params": [
                                          "* session id *",
                                          [
                                            "OpaqueRef:a83a416f-c97d-1ed8-c7fc-213af89b8f86"
                                          ],
                                          {}
                                        ]
                                      },
                                      "message": "INTERNAL_ERROR(unable to gather the coordinator's UUID: Not_found)",
                                      "name": "XapiError",
                                      "stack": "XapiError: INTERNAL_ERROR(unable to gather the coordinator's UUID: Not_found)\n    at Function.wrap (file:///opt/xen-orchestra/packages/xen-api/_XapiError.mjs:16:12)\n    at file:///opt/xen-orchestra/packages/xen-api/transports/json-rpc.mjs:38:21\n    at runNextTicks (node:internal/process/task_queues:65:5)\n    at processImmediate (node:internal/timers:453:9)\n    at process.callbackTrampoline (node:internal/async_hooks:130:17)"
                                    }
                                  }
                                  
                                  psafontP 1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Offline
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    That's better 🙂 @psafont now we now we are missing an UUID somewhere?

                                    1 Reply Last reply Reply Quote 0
                                    • psafontP Offline
                                      psafont Vates 🪐 XAPI & Network Team @jmannik
                                      last edited by psafont

                                      @jmannik
                                      So the problem goes like this:

                                      • HA uses a local-only database to avoid depending on the database
                                      • This database contains a mapping from UUID to the IP host_address for all hosts in an HA cluster / pool. This information should be gathered right before HA is enabled, from the normal database.
                                      • When trying to enable HA, the host fetches the coordinator's address from the filesystem. Then it uses the previous mapping and the coordinator address to find the coordinator's UUID. This step fails.

                                      I'm not sure what has actually happening, but some scenarios come to mind:

                                      • XO isn't calling the API function Host.preconfigure_ha, which means the local database is not created (unlikely)
                                      • The coordinator's address has somehow changed between the local database being written and the HA being enabled

                                      things to check out:

                                      • inspect the values that the failing host has about the host_address of the coordinator / master host, both on:
                                        1. the normal database. You can SSH into the failing host and run the following command, replacinf POOL_UUID with the actual uuid, this can be done deleting POOL_UUID , placing the cursor after the = and pressing tab twice.
                                      xe pool-param-get uuid=POOL_UUID param-name=master | xargs -I _ xe host-param-get uuid=_ param-name=address
                                      
                                      1. and the pool role file, similar to the previous command, SSH in the failing host and run
                                      cat /etc/xensource/pool.conf
                                      

                                      Let us know how it goes. If the IPs don't match, there's a problem with the configuration of the member, and otherwise it's because the local database is outdated and should be refreshed before enabling HA. I don't know how XO handles it.

                                      J 1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        @psafont I'm not sure to follow, I don't remember seeing any documented endpoint related to prepare HA 🤔

                                        psafontP 1 Reply Last reply Reply Quote 0
                                        • psafontP Offline
                                          psafont Vates 🪐 XAPI & Network Team @olivierlambert
                                          last edited by

                                          @olivierlambert The call is indeed hidden from the docs, and only callable from inside a pool... it's called as part as Pool.enable_ha

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            So we probably need to tell XO team the "right way" to enable HA because there's no way to know from "outside" 😓

                                            psafontP 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post