XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Error: Connection refused (calling connect ) (XCP-ng toolstack hang on boot)

    Scheduled Pinned Locked Moved XCP-ng
    8 Posts 2 Posters 2.0k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S Offline
      ScarfAntennae
      last edited by ScarfAntennae

      Hi, I use a startup script that starts VMs in a specific order, every time my main and single XCP-ng host is restarted.

      The past few days I've been getting random failures, where at first the XOA VM just loses connectivity to the host toolstack, even though all VMs are up and the host is functional (I can ssh in).

      The script was configured like this:

      #!/bin/bash
      
      # xe vm-list for name-label, add in start order
      vms=(vm1 vm2 vm3 etc...)
      wait=30s
      
      # No need to modify below
      initwait=3m
      vmslength=${#vms[@]}
      log=/root/scripts/startup.log
      
      start_vm () {
         echo -n "[$(date +"[%Y-%m-%d %H:%M:%S]")] Starting $1 ... " >> ${log}
         /opt/xensource/bin/xe vm-start name-label=$1
         if [ $? -eq 0 ]
           then
             echo "Success" >> ${log}
           else
             echo "FAILED" >> ${log}
         fi
      
         # Wait if not the last vm
         if [ "$1" != "${vms[${vmslength}-1]}" ]
           then
             echo "Waiting ${wait}" >> ${log}
             sleep ${wait}
         fi
      }
      
      echo "[$(date +"[%Y-%m-%d %H:%M:%S]")] Running autostart script (Waiting ${initwait})" > ${log}
      sleep ${initwait}
      
      for vm in ${vms[@]}
      do
        start_vm ${vm}
      done
      
      echo "[$(date +"%T")] Startup complete." >> ${log}
      echo
      

      As you can see the initwait is set to 3m, having the script wait for the XCP-ng toolstack to get ready, and I've had no issues with this config for the past year.

      Now I have noticed that the toolstack takes about 10 minutes to start, where it took about 2 beforehand. I have no idea what's going wrong because I didn't do any updates in the meantime.

      Does anyone have an idea where I should look to see what's causing this 10 minute hang?

      Even after rebooting the host, after the XOA VM is up, it can't connect to the toolstack for some reason:
      connect ETIMEDOUT host-ip:443

      Update: the XOA error is due to a kernel issue. 5.10.0-25-amd64 works, 5.10.0-26-amd64 cannot connect to any XCP-ng host. This still leaves me wondering why the XCP-ng host toolstack startup time has increased so drastically.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        Hi,

        1. Are you using XOA or XO from the sources? XOA is the version you find on https://xen-orchestra.com that we consistently test before release
        2. And is it fully up to date?
        3. Is your XCP-ng host fully up to date? 8.2 or 8.3?
        S 1 Reply Last reply Reply Quote 0
        • S Offline
          ScarfAntennae @olivierlambert
          last edited by

          olivierlambert Sorry for the delay, I'm not receiving emails for replies here, for some reason.

          I'm using XO from the sources

          Xen Orchestra, commit 3c047
          xo-server 5.124.0
          xo-web 5.126.0
          

          XCP-ng was fully up to date when issues occurred, I do have a few updates pending now, but haven't rebooted since the issue:

          software-version (MRO)    : product_version: 8.2.1; 
          product_version_text: 8.2; 
          product_version_text_short: 8.2; 
          platform_name: XCP; 
          platform_version: 3.2.1; 
          product_brand: XCP-ng; 
          build_number: release/yangtze/master/58; 
          hostname: localhost; 
          date: 2023-08-09; 
          dbv: 0.0.1; 
          xapi: 1.20;
          xen: 4.13.5-9.36; 
          linux: 4.19.0+1;
          xencenter_min: 2.16; 
          xencenter_max: 2.16;
          network_backend: openvswitch; 
          db_schema: 5.603
          ``
          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Okay so you are not using XOA but XO from the sources (don't mix them 🙂 XOA is the turnkey version we distribute with support on https://xen-orchestra.com)

            First, you must update to the latest commit (your is already 3 weeks old), rebuild and test again.

            Then, you also need to update your XCP-ng and reboot, to see if it's better: if your toolstack takes 10 minutes to boot, it's normal XO can't connect, so that's your main problem.

            S 1 Reply Last reply Reply Quote 0
            • S Offline
              ScarfAntennae @olivierlambert
              last edited by

              olivierlambert Ah, I understand the naming convention now.

              So XO, but XO is irellevant to this issue. The problem was the 10 minutes it took the toolstack to boot up, compared to the 1-2 minutes it always took.

              I updated XCP-ng now, rebooted, and both hosts took 10 minutes for the stack to come up again. Any ideas what could be causing this delay and how we could troubleshoot it?

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Could be a time to plug to a storage. Do you have a SR (or ISO SR) that is contained in a VM on this very host?

                S 1 Reply Last reply Reply Quote 0
                • S Offline
                  ScarfAntennae @olivierlambert
                  last edited by

                  olivierlambert I do.

                  I have also noticed something extremely weird.

                  I have 3 HDDs attached to one host.
                  2x2TB raid 1 (software raid done on the XCP-ng host)
                  1x4TB

                  lsblk shows:

                  ... SNIP ...
                  sda                                                             8:0    0   1.8T  0 disk
                  ├─sda2                                                          8:2    0   1.8T  0 part
                  └─sda1                                                          8:1    0     2G  0 part
                  ...
                  sdb                                                             8:16   0   1.8T  0 disk
                  ├─sdb2                                                          8:18   0   1.8T  0 part
                  └─sdb1                                                          8:17   0     2G  0 part
                    └─md127                                                       9:127  0     2G  0 raid1
                  ...
                  sde                                                             8:64   0   3.7T  0 disk
                  ├─sde2                                                          8:66   0   3.7T  0 part
                  └─sde1                                                          8:65   0     2G  0 part
                    └─md127                                                       9:127  0     2G  0 raid1
                  

                  All 3 disks are passed through to a TrueNAS VM on the host, and all the data is properly stored, but I have no idea why mdadm shows that the 4TB disk is part of the raid, instead of the other one?

                  /dev/md127:
                             Version : 1.2
                       Creation Time : Sun Aug 27 14:32:08 2023
                          Raid Level : raid1
                          Array Size : 2094080 (2045.00 MiB 2144.34 MB)
                       Used Dev Size : 2094080 (2045.00 MiB 2144.34 MB)
                        Raid Devices : 2
                       Total Devices : 2
                         Persistence : Superblock is persistent
                  
                         Update Time : Sun Oct  8 12:07:28 2023
                               State : clean
                      Active Devices : 2
                     Working Devices : 2
                      Failed Devices : 0
                       Spare Devices : 0
                  
                  Consistency Policy : resync
                  
                                Name : november:swap0
                                UUID : ae045fa0:74b00896:3134ede5:c837bec3
                              Events : 27
                  
                      Number   Major   Minor   RaidDevice State
                         0       8       65        0      active sync   /dev/sde1
                         1       8       17        1      active sync   /dev/sdb1
                  

                  Anyways, this doesn't seem to be the issue, since the other host which has no HDDs attached, only m.2 VM SR's, and it also took exactly 10 minutes for the toolstack to go up.

                  Now XO can't reach any of the hosts, even though all the VMs are up.

                  1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    1. Any of the host: if they are in the same pool, that's logical. Only the master is needed to be reach.
                    2. XAPI will probably be in "Starting state" as long as all SR aren't plugged. If you have the SR on a VM on another host than the master, reboot the master only, you should be able to connect sooner
                    3. Alternatively, check https://docs.xcp-ng.org/troubleshooting/
                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post