HA failover reaction time question
-
-
@Danp Hello Danp,
no just the standard DRS and High availabilty configuration, no overkill FT
In case of host failure, VM would restart with 10 seconds (at worse) -
Yes it is normal and the default timeout. You can reduce it, but then you might expect false positives, so if you do so, go progressively on lower values.
Default is 60 seconds. You can try to do this:
xe pool-param-set uuid=<pool uuid> other-config:default_ha_timeout=<timeout in seconds>
With 10 seconds. But please do that outside production to test the behavior before. Lower you go, higher the chances to trigger it while having a minor network delay and such. Up to you to find your sweet spot with your infrastructure. Happy to hear your feedback
-
@olivierlambert Thanks a lot.
We have not SPOF and full fiber 100Gb network spine/leaf infrastructure so I will give it a go (currently we are only on a test plateform so I do as much as I need ) -
Great, keep us posted!
-
@olivierlambert Just tried but there is no change in reaction time.
After googling this parameter I found this page you wrote (small world) on xcp-ng.org website https://xcp-ng.org/blog/2024/08/22/xcp-ng-high-availability-a-guide/ which indicates that this timeout purpose is for self fencing in case of loss of network/storage (I actually had this page opened already in my browser but missed this line)
Doesn't seem to influence restart timer in case of full host failure. -
@dsmteam Did you try disabling and then enabling HA again to be sure that the new setting was being used?
-
@Danp Oh..................
Indeed, much faster now. Down from 2:00 minutes to 1:20 minutes
Less than 10 seconds might be too aggressive.
This is closer to what we expect.
I can see in the GUI that when I bring a host down, the pool still takes a minute to consider the host down. Any way to decrease this timer further or there are too many dependencies ? -
That's a good progress For the other number, let me ask around
-
@olivierlambert I think I found what I need in the following documentation
https://xapi-project.github.io/features/HA/HA.html
Various parameters which must be the same of every hosts in /etc/xensource/xhad.conf<parameters> <HeartbeatInterval>4</HeartbeatInterval> <HeartbeatTimeout>30</HeartbeatTimeout> <StateFileInterval>4</StateFileInterval> <StateFileTimeout>30</StateFileTimeout> <HeartbeatWatchdogTimeout>30</HeartbeatWatchdogTimeout> <StateFileWatchdogTimeout>45</StateFileWatchdogTimeout> <BootJoinTimeout>90</BootJoinTimeout> <EnableJoinTimeout>90</EnableJoinTimeout> <XapiHealthCheckInterval>60</XapiHealthCheckInterval> <XapiHealthCheckTimeout>10</XapiHealthCheckTimeout> <XapiRestartAttempts>1</XapiRestartAttempts> <XapiRestartTimeout>30</XapiRestartTimeout> <XapiLicenseCheckTimeout>30</XapiLicenseCheckTimeout> </parameters>
-
Explanations here: https://github.com/xapi-project/xen-api/pull/4169
No idea about how to tinker it. But happy to hear your experiments
-
@olivierlambert Unfortunately, the parameters are reverted back to their default value when I turn on HA. Might be hard coded somewhere.
-
@dsmteam Still trying to browse the web and various xo forum but it looks like those parameters are in the .c and other precompile file so the build in xcp-ng are probably using those default parameters.