Error: Connection refused (calling connect ) (XCP-ng toolstack hang on boot)
-
Hi, I use a startup script that starts VMs in a specific order, every time my main and single XCP-ng host is restarted.
The past few days I've been getting random failures, where at first the XOA VM just loses connectivity to the host toolstack, even though all VMs are up and the host is functional (I can ssh in).
The script was configured like this:
#!/bin/bash # xe vm-list for name-label, add in start order vms=(vm1 vm2 vm3 etc...) wait=30s # No need to modify below initwait=3m vmslength=${#vms[@]} log=/root/scripts/startup.log start_vm () { echo -n "[$(date +"[%Y-%m-%d %H:%M:%S]")] Starting $1 ... " >> ${log} /opt/xensource/bin/xe vm-start name-label=$1 if [ $? -eq 0 ] then echo "Success" >> ${log} else echo "FAILED" >> ${log} fi # Wait if not the last vm if [ "$1" != "${vms[${vmslength}-1]}" ] then echo "Waiting ${wait}" >> ${log} sleep ${wait} fi } echo "[$(date +"[%Y-%m-%d %H:%M:%S]")] Running autostart script (Waiting ${initwait})" > ${log} sleep ${initwait} for vm in ${vms[@]} do start_vm ${vm} done echo "[$(date +"%T")] Startup complete." >> ${log} echo
As you can see the initwait is set to 3m, having the script wait for the XCP-ng toolstack to get ready, and I've had no issues with this config for the past year.
Now I have noticed that the toolstack takes about 10 minutes to start, where it took about 2 beforehand. I have no idea what's going wrong because I didn't do any updates in the meantime.
Does anyone have an idea where I should look to see what's causing this 10 minute hang?
Even after rebooting the host, after the XOA VM is up, it can't connect to the toolstack for some reason:
connect ETIMEDOUT host-ip:443
Update: the XOA error is due to a kernel issue. 5.10.0-25-amd64 works, 5.10.0-26-amd64 cannot connect to any XCP-ng host. This still leaves me wondering why the XCP-ng host toolstack startup time has increased so drastically.
-
Hi,
- Are you using XOA or XO from the sources? XOA is the version you find on https://xen-orchestra.com that we consistently test before release
- And is it fully up to date?
- Is your XCP-ng host fully up to date? 8.2 or 8.3?
-
@olivierlambert Sorry for the delay, I'm not receiving emails for replies here, for some reason.
I'm using XO from the sources
Xen Orchestra, commit 3c047 xo-server 5.124.0 xo-web 5.126.0
XCP-ng was fully up to date when issues occurred, I do have a few updates pending now, but haven't rebooted since the issue:
software-version (MRO) : product_version: 8.2.1; product_version_text: 8.2; product_version_text_short: 8.2; platform_name: XCP; platform_version: 3.2.1; product_brand: XCP-ng; build_number: release/yangtze/master/58; hostname: localhost; date: 2023-08-09; dbv: 0.0.1; xapi: 1.20; xen: 4.13.5-9.36; linux: 4.19.0+1; xencenter_min: 2.16; xencenter_max: 2.16; network_backend: openvswitch; db_schema: 5.603 ``
-
Okay so you are not using XOA but XO from the sources (don't mix them XOA is the turnkey version we distribute with support on https://xen-orchestra.com)
First, you must update to the latest commit (your is already 3 weeks old), rebuild and test again.
Then, you also need to update your XCP-ng and reboot, to see if it's better: if your toolstack takes 10 minutes to boot, it's normal XO can't connect, so that's your main problem.
-
@olivierlambert Ah, I understand the naming convention now.
So XO, but XO is irellevant to this issue. The problem was the 10 minutes it took the toolstack to boot up, compared to the 1-2 minutes it always took.
I updated XCP-ng now, rebooted, and both hosts took 10 minutes for the stack to come up again. Any ideas what could be causing this delay and how we could troubleshoot it?
-
Could be a time to plug to a storage. Do you have a SR (or ISO SR) that is contained in a VM on this very host?
-
@olivierlambert I do.
I have also noticed something extremely weird.
I have 3 HDDs attached to one host.
2x2TB raid 1 (software raid done on the XCP-ng host)
1x4TBlsblk
shows:... SNIP ... sda 8:0 0 1.8T 0 disk ├─sda2 8:2 0 1.8T 0 part └─sda1 8:1 0 2G 0 part ... sdb 8:16 0 1.8T 0 disk ├─sdb2 8:18 0 1.8T 0 part └─sdb1 8:17 0 2G 0 part └─md127 9:127 0 2G 0 raid1 ... sde 8:64 0 3.7T 0 disk ├─sde2 8:66 0 3.7T 0 part └─sde1 8:65 0 2G 0 part └─md127 9:127 0 2G 0 raid1
All 3 disks are passed through to a TrueNAS VM on the host, and all the data is properly stored, but I have no idea why mdadm shows that the 4TB disk is part of the raid, instead of the other one?
/dev/md127: Version : 1.2 Creation Time : Sun Aug 27 14:32:08 2023 Raid Level : raid1 Array Size : 2094080 (2045.00 MiB 2144.34 MB) Used Dev Size : 2094080 (2045.00 MiB 2144.34 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Sun Oct 8 12:07:28 2023 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : resync Name : november:swap0 UUID : ae045fa0:74b00896:3134ede5:c837bec3 Events : 27 Number Major Minor RaidDevice State 0 8 65 0 active sync /dev/sde1 1 8 17 1 active sync /dev/sdb1
Anyways, this doesn't seem to be the issue, since the other host which has no HDDs attached, only m.2 VM SR's, and it also took exactly 10 minutes for the toolstack to go up.
Now XO can't reach any of the hosts, even though all the VMs are up.
-
- Any of the host: if they are in the same pool, that's logical. Only the master is needed to be reach.
- XAPI will probably be in "Starting state" as long as all SR aren't plugged. If you have the SR on a VM on another host than the master, reboot the master only, you should be able to connect sooner
- Alternatively, check https://docs.xcp-ng.org/troubleshooting/