Error: Connection refused (calling connect ) (XCP-ng toolstack hang on boot)

ScarfAntennae

Hi, I use a startup script that starts VMs in a specific order, every time my main and single XCP-ng host is restarted.

The past few days I've been getting random failures, where at first the XOA VM just loses connectivity to the host toolstack, even though all VMs are up and the host is functional (I can ssh in).

The script was configured like this:

#!/bin/bash

# xe vm-list for name-label, add in start order
vms=(vm1 vm2 vm3 etc...)
wait=30s

# No need to modify below
initwait=3m
vmslength=${#vms[@]}
log=/root/scripts/startup.log

start_vm () {
   echo -n "[$(date +"[%Y-%m-%d %H:%M:%S]")] Starting $1 ... " >> ${log}
   /opt/xensource/bin/xe vm-start name-label=$1
   if [ $? -eq 0 ]
     then
       echo "Success" >> ${log}
     else
       echo "FAILED" >> ${log}
   fi

   # Wait if not the last vm
   if [ "$1" != "${vms[${vmslength}-1]}" ]
     then
       echo "Waiting ${wait}" >> ${log}
       sleep ${wait}
   fi
}

echo "[$(date +"[%Y-%m-%d %H:%M:%S]")] Running autostart script (Waiting ${initwait})" > ${log}
sleep ${initwait}

for vm in ${vms[@]}
do
  start_vm ${vm}
done

echo "[$(date +"%T")] Startup complete." >> ${log}
echo

As you can see the initwait is set to 3m, having the script wait for the XCP-ng toolstack to get ready, and I've had no issues with this config for the past year.

Now I have noticed that the toolstack takes about 10 minutes to start, where it took about 2 beforehand. I have no idea what's going wrong because I didn't do any updates in the meantime.

Does anyone have an idea where I should look to see what's causing this 10 minute hang?

Even after rebooting the host, after the XOA VM is up, it can't connect to the toolstack for some reason:
connect ETIMEDOUT host-ip:443

Update: the XOA error is due to a kernel issue. 5.10.0-25-amd64 works, 5.10.0-26-amd64 cannot connect to any XCP-ng host. This still leaves me wondering why the XCP-ng host toolstack startup time has increased so drastically.

olivierlambert

Hi,

Are you using XOA or XO from the sources? XOA is the version you find on https://xen-orchestra.com that we consistently test before release
And is it fully up to date?
Is your XCP-ng host fully up to date? 8.2 or 8.3?

ScarfAntennae

@olivierlambert Sorry for the delay, I'm not receiving emails for replies here, for some reason.

I'm using XO from the sources

Xen Orchestra, commit 3c047
xo-server 5.124.0
xo-web 5.126.0

XCP-ng was fully up to date when issues occurred, I do have a few updates pending now, but haven't rebooted since the issue:

software-version (MRO)    : product_version: 8.2.1; 
product_version_text: 8.2; 
product_version_text_short: 8.2; 
platform_name: XCP; 
platform_version: 3.2.1; 
product_brand: XCP-ng; 
build_number: release/yangtze/master/58; 
hostname: localhost; 
date: 2023-08-09; 
dbv: 0.0.1; 
xapi: 1.20;
xen: 4.13.5-9.36; 
linux: 4.19.0+1;
xencenter_min: 2.16; 
xencenter_max: 2.16;
network_backend: openvswitch; 
db_schema: 5.603
``

olivierlambert

Okay so you are not using XOA but XO from the sources (don't mix them XOA is the turnkey version we distribute with support on https://xen-orchestra.com)

First, you must update to the latest commit (your is already 3 weeks old), rebuild and test again.

Then, you also need to update your XCP-ng and reboot, to see if it's better: if your toolstack takes 10 minutes to boot, it's normal XO can't connect, so that's your main problem.

ScarfAntennae

@olivierlambert Ah, I understand the naming convention now.

So XO, but XO is irellevant to this issue. The problem was the 10 minutes it took the toolstack to boot up, compared to the 1-2 minutes it always took.

I updated XCP-ng now, rebooted, and both hosts took 10 minutes for the stack to come up again. Any ideas what could be causing this delay and how we could troubleshoot it?

olivierlambert

Could be a time to plug to a storage. Do you have a SR (or ISO SR) that is contained in a VM on this very host?

ScarfAntennae

@olivierlambert I do.

I have also noticed something extremely weird.

I have 3 HDDs attached to one host.
2x2TB raid 1 (software raid done on the XCP-ng host)
1x4TB

lsblk shows:

... SNIP ...
sda                                                             8:0    0   1.8T  0 disk
├─sda2                                                          8:2    0   1.8T  0 part
└─sda1                                                          8:1    0     2G  0 part
...
sdb                                                             8:16   0   1.8T  0 disk
├─sdb2                                                          8:18   0   1.8T  0 part
└─sdb1                                                          8:17   0     2G  0 part
  └─md127                                                       9:127  0     2G  0 raid1
...
sde                                                             8:64   0   3.7T  0 disk
├─sde2                                                          8:66   0   3.7T  0 part
└─sde1                                                          8:65   0     2G  0 part
  └─md127                                                       9:127  0     2G  0 raid1

All 3 disks are passed through to a TrueNAS VM on the host, and all the data is properly stored, but I have no idea why mdadm shows that the 4TB disk is part of the raid, instead of the other one?

/dev/md127:
           Version : 1.2
     Creation Time : Sun Aug 27 14:32:08 2023
        Raid Level : raid1
        Array Size : 2094080 (2045.00 MiB 2144.34 MB)
     Used Dev Size : 2094080 (2045.00 MiB 2144.34 MB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Sun Oct  8 12:07:28 2023
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : november:swap0
              UUID : ae045fa0:74b00896:3134ede5:c837bec3
            Events : 27

    Number   Major   Minor   RaidDevice State
       0       8       65        0      active sync   /dev/sde1
       1       8       17        1      active sync   /dev/sdb1

Anyways, this doesn't seem to be the issue, since the other host which has no HDDs attached, only m.2 VM SR's, and it also took exactly 10 minutes for the toolstack to go up.

Now XO can't reach any of the hosts, even though all the VMs are up.

olivierlambert

Any of the host: if they are in the same pool, that's logical. Only the master is needed to be reach.
XAPI will probably be in "Starting state" as long as all SR aren't plugged. If you have the SR on a VM on another host than the master, reboot the master only, you should be able to connect sooner
Alternatively, check https://docs.xcp-ng.org/troubleshooting/