@olivierlambert Thanks a lot.
We have not SPOF and full fiber 100Gb network spine/leaf infrastructure so I will give it a go (currently we are only on a test plateform so I do as much as I need )
Best posts made by dsmteam
-
RE: HA failover reaction time question
Latest posts made by dsmteam
-
After days of research and tinkering : a working guide for Debian 12 template with cloud-init and DHCP
I have tried for days to make a Debian template (this probably applies to other Linux OS)
The main issue I was facing was that when creating multiple machine they would get the same IP from our DHCP server.
The reason is that Debian sends the machine-id (under /etc/machine-id) as dhcp identifier.
Adding to /etc/dhcp/dhclient.conf file dhcp-client-identifier = hardware;
did not help and deleting /etc/machine-id resulted in the absence of generation of a new id by cloud init for some reason and the VM not requesting an IP at all.This is what I did:
Downloaded from https://cdimage.debian.org/images/cloud/ the latest bookworm raw file and imported it as a disk in XO
Booted a VM with a random template and some network (internet access will be usefull in a few steps)
Deleted the existing disk and attached the raw disk I uploaded
Converted the VM to template
Created a VM from this template with my ssh key
Once booted, you will need to install dmidecode (https://packages.debian.org/search?keywords=dmidecode ) due diligence on your part to get the latest .deb, install with dpgk -i
Also install xcp-ng guest tools
Then run :
sudo cloud-init clean
sudo cloud-init clean --logs
rm /home/debian/.ssh/authorized_keys
sudo mkdir -p /var/lib/cloud/scripts/per-once/ (folders get deleted on cloud-init clean)
cd /var/lib/cloud/scripts/per-once/
sudo nano generate-machine-id.sh
(coming from user modem7 on github)#!/bin/bash # KVM UUID Recreator # Use this for new VM's or templates that require a unique machine ID. if [[ $EUID -ne 0 ]]; then echo "This script must be run as root" exit 1 fi UUID=$(dmidecode -s system-uuid | tr -d '-') if grep -q "$UUID" /etc/machine-id; then echo "UUID matches" else echo "UUID does not match. Recreating." echo -n > /etc/machine-id && echo -n > /var/lib/dbus/machine-id && systemd-machine-id-setup && reboot fi
chmod +x generate-machine-id.sh
sudo cat /dev/null > ~/.bash_history && history -c && shutdown now
You can now rename the VM and it's disk, delete the network card to prevent the template to have some tags added automatically with the IPV4 and IPV6 and convert the VM to a template.
You should now have a working Debian 12 template accessible with your ssh key if you add it on deploy and DHCP working and not overlapping. Hopefully, I did not forget anything.
On first start, the VM will loop once after the first prompt. The reboot is required for the change of the machine-id to be effective.
This is a lot of work and I have no doubt there is a simpler solution but I couldn't find it.
-
RE: HA failover reaction time question
@dsmteam Still trying to browse the web and various xo forum but it looks like those parameters are in the .c and other precompile file so the build in xcp-ng are probably using those default parameters.
-
RE: HA failover reaction time question
@olivierlambert Unfortunately, the parameters are reverted back to their default value when I turn on HA. Might be hard coded somewhere.
-
RE: HA failover reaction time question
@olivierlambert I think I found what I need in the following documentation
https://xapi-project.github.io/features/HA/HA.html
Various parameters which must be the same of every hosts in /etc/xensource/xhad.conf<parameters> <HeartbeatInterval>4</HeartbeatInterval> <HeartbeatTimeout>30</HeartbeatTimeout> <StateFileInterval>4</StateFileInterval> <StateFileTimeout>30</StateFileTimeout> <HeartbeatWatchdogTimeout>30</HeartbeatWatchdogTimeout> <StateFileWatchdogTimeout>45</StateFileWatchdogTimeout> <BootJoinTimeout>90</BootJoinTimeout> <EnableJoinTimeout>90</EnableJoinTimeout> <XapiHealthCheckInterval>60</XapiHealthCheckInterval> <XapiHealthCheckTimeout>10</XapiHealthCheckTimeout> <XapiRestartAttempts>1</XapiRestartAttempts> <XapiRestartTimeout>30</XapiRestartTimeout> <XapiLicenseCheckTimeout>30</XapiLicenseCheckTimeout> </parameters>
-
RE: HA failover reaction time question
@Danp Oh..................
Indeed, much faster now. Down from 2:00 minutes to 1:20 minutes
Less than 10 seconds might be too aggressive.
This is closer to what we expect.
I can see in the GUI that when I bring a host down, the pool still takes a minute to consider the host down. Any way to decrease this timer further or there are too many dependencies ? -
RE: HA failover reaction time question
@olivierlambert Just tried but there is no change in reaction time.
After googling this parameter I found this page you wrote (small world) on xcp-ng.org website https://xcp-ng.org/blog/2024/08/22/xcp-ng-high-availability-a-guide/ which indicates that this timeout purpose is for self fencing in case of loss of network/storage (I actually had this page opened already in my browser but missed this line)
Doesn't seem to influence restart timer in case of full host failure. -
RE: HA failover reaction time question
@olivierlambert Thanks a lot.
We have not SPOF and full fiber 100Gb network spine/leaf infrastructure so I will give it a go (currently we are only on a test plateform so I do as much as I need ) -
RE: HA failover reaction time question
@Danp Hello Danp,
no just the standard DRS and High availabilty configuration, no overkill FT
In case of host failure, VM would restart with 10 seconds (at worse) -
HA failover reaction time question
Hello everyone,
we are testing XCP-NG and are quite satisfied with the ease of use and functionnality (still using ESX with around 100 hosts)
However one caveat we saw (same issue with Proxmox) is that the failover reaction time is quite long compared to ESX.
Under ESX, VM that are hosted on a failed host are restarted on a different host within seconds.
With XCP-NG it takes about 2 minutes for the VM to be restarted on a different host (HA cluster of 3 hosts which had ESX installed before so the physical environnement is identical)
Are those delays normal ? I suppose they are according to various video we saw online showing this kind of reaction time.
If they are, is there some way to reduce them ?
Couldn't find any information nor settings in Orchestra or in the hosts themselves