Issue with Two-Node HA Cluster: XAPI Failing to Log In and HA Disablement

ace_token

Hi everyone,

I'm currently facing an issue with my two-node HA cluster. I don't have access to the REST API, and I'm receiving the following error messages:

Broadcast message from systemd-journald@hv02-xcp-mo (Mon 2024-05-20 18:38:07 CEST):

xapi-nbd[9548]: main: Failed to log in via xapi's Unix domain socket in 300.000000 seconds

Broadcast message from systemd-journald@hv02-xcp-mo (Mon 2024-05-20 18:38:07 CEST):

xapi-nbd[9548]: main: Caught unexpected exception: (Failure

Broadcast message from systemd-journald@hv02-xcp-mo (Mon 2024-05-20 18:38:07 CEST):

xapi-nbd[9548]: main:   "Failed to log in via xapi's Unix domain socket in 300.000000 seconds")

Due to this, nothing seems to be working correctly. I am unable to manage the cluster or access any of the services that rely on XAPI.

Additionally, I am trying to disable HA from the CLI but encounter the following error:

[18:47 hv02-xcp-mo d4e8e42c-e758-dd47-800e-ce2aaae3abdc]# xe pool-ha-disable
The server could not join the liveset because the HA daemon could not access the heartbeat disk.

I have the disk on storage, but I can't mount it.

How can I disable HA from the CLI, even with these problems?

Please, this is urgent. Any help or guidance would be greatly appreciated!

Thank you.

ace_token

@Danp, glusterfs.

Anyway I found the problem. It was a problem with cache size and the client fuse didn't mount it.

Thank you really much for helping me.

ace_token

from
May 20 19:01:05 hv02-xcp-mo xapi: [ warn||0 |Checking HA configuration D:6662d56d1422|static_vdis] Attempt to reattach static VDIs via 'attach-static-vdis start' failed: INTERNAL_ERROR: [ Subprocess exited with unexpected code 1; stdout = [ ]; stderr = [ Redirecting to /bin/systemctl start attach-static-vdis.service\x0AJob for attach-static-vdis.service failed because the control process exited with error code. See "systemctl status attach-static-vdis.service" and "journalctl -xe" for details.\x0A ] ]
May 20 19:01:05 hv02-xcp-mo xapi: [debug||0 |Checking HA configuratio/var/log/xensource.log

Danp

You could try running the following command on each host to disable HA on each one --

xe host-emergency-ha-disable --force

FWIW, HA requires three nodes to run properly, and not every environment needs HA even if you think you do.

ace_token

@Danp said in Issue with Two-Node HA Cluster: XAPI Failing to Log In and HA Disablement:

You could try running the following command on each host to disable HA on each one --

xe host-emergency-ha-disable --force

FWIW, HA requires three nodes to run properly, and not every environment needs HA even if you think you do.

Thank you for your input!

The XAPI is now functioning again after disable HA. However, I'm encountering an issue while reattaching the storage to each node. Here are the commands and outputs:

List Hosts:

[21:53 hv01-xcp-mo ~]# xe host-list
uuid ( RO)                : 0a38ea70-8529-4d30-bf44-8f01e1e4101b
          name-label ( RW): hv01-xcp-mo
    name-description ( RW): 

uuid ( RO)                : 67c5d00a-977d-46e2-98ee-0aa620e94db0
          name-label ( RW): hv02-xcp-mo
    name-description ( RW):

List PBDs for SR:

[21:50 hv01-xcp-mo ~]# xe pbd-list sr-uuid=d4e8e42c-e758-dd47-800e-ce2aaae3abdc 
uuid ( RO)                  : ccc3067f-afc2-a344-028b-3815da0b5afe
             host-uuid ( RO): 0a38ea70-8529-4d30-bf44-8f01e1e4101b
               sr-uuid ( RO): d4e8e42c-e758-dd47-800e-ce2aaae3abdc
         device-config (MRO): backupservers: 10.1.80.11:/san; server: 10.1.80.12:/san
    currently-attached ( RO): false

uuid ( RO)                  : 280f2c70-96e3-0404-2d20-61789c113356
             host-uuid ( RO): 67c5d00a-977d-46e2-98ee-0aa620e94db0
               sr-uuid ( RO): d4e8e42c-e758-dd47-800e-ce2aaae3abdc
         device-config (MRO): backupservers: 10.1.80.11:/san; server: 10.1.80.12:/san
    currently-attached ( RO): false

Attempt to Plug PBD:

[21:57 hv01-xcp-mo ~]# xe pbd-plug uuid=ccc3067f-afc2-a344-028b-3815da0b5afe
Error code: SR_BACKEND_FAILURE_12
Error parameters: , mount failed with return code 1,

Can you help me?

Thanks again for your support!

Danp

Storage related errors will be in SMlog

Edit: What type of storage is this?

ace_token

@Danp, glusterfs.

Anyway I found the problem. It was a problem with cache size and the client fuse didn't mount it.

Thank you really much for helping me.