The HA doesn't work

Danp

Can you describe in more detail this dom0 crash? You should investigate in the logs for why HA didn't kick in and promote a new pool master.

sixela

@Danp Hello,

I think he crashed due to a hardware problem.

The ha started but I got the above error so I had to start the server by hand afterwards but the dom0 that had crashed was UP again

Feb 27 00:03:43DOM0 xapi: [error||14746299 ||backtrace] Raised Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_46);[S();S(The VDI is not available [opterr=['HOST_OFFLINE', 'OpaqueRef:e67d5aed-ae13-497e-ac16-29882c317ef3']]]);S()]])

sixela

@Danp Hello,

In addition :

28 machines impacted, 15 left ok, and 13 with the error msg. the HA did try, but there was a problem.

Danp

@sixela There's likely more information in the logs that would explain why a new pool master wasn't designated.

sixela

@Danp I'm not talking about a new master in my problem... but that my vm's all restart with the ha on another host in the same cluster with the restart if possible parameter that it tries but it's still lock with the following error:

Feb 27 00:03:43DOM0 xapi: [error||14746299 ||backtrace] Raised Storage_error ([S(Backend_error);[S(SR_BACKEND_FAILURE_46);[S();S(The VDI is not available [opterr=['HOST_OFFLINE', 'OpaqueRef:e67d5aed-ae13-497e-ac16-29882c317ef3']]]);S()]])

Translated with DeepL.com (free version)

Danp

@sixela I understand. I'm not an expert on HA functionality, but I suspect that the new pool master would need to be designated as a first step. That is why I am suggesting that you investigate why this didn't automatically occur.

tjkreidl

How many hosts in your pool? For HA to work out of the box, you need at least three hosts in a pool. Also, are all your hosts properly time synchronized to the same time source?
They need to be very close in time to each other for HA to work properly. Note that when HA is first enabled on a given host, it has to be rebooted for HA to function.

sixela

@tjkreidl Hello,

We have 17 hosts in the pool and they are well synchronized with a time server

tjkreidl

@sixela Hmmm ... that SR backend error makes me wonder if the place where you designate HA info to be stored (the so-called "heartbeat SR") might be corrupted or such?

sixela

@tjkreidl Hello,

Didn't I get half of ok too?

28 machines impacted, 15 left ok, and 13 with the error msg