XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Rolling Pool Update - host took too long to restart

    Scheduled Pinned Locked Moved Xen Orchestra
    36 Posts 9 Posters 7.0k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • T Offline
      tuxpowered @olivierlambert
      last edited by

      @olivierlambert It has not for me on 2 different clusters. And I made sure I was on the latest XO release before attempting.

      This also occurred about a month ago, but I was not on the current release. So this time I updated to the current xo first then proceeded to do the rolling pool update.

      One of the clusters is local to the xo vm. (the VM runs on the cluster)
      The other is done over a VPN connection. Both failed with timeouts and the machines were up.

      I also had verification done by being connected to the iLO on both systems. These are DL360 GEN10 systems with 2.5 - 5GB internet connections, with at least 128GB of ram so no slow machines. All disks are also SSD's.

      Not sure if any of that really helps, only to point out that the systems are not slow, they were observed coming back on line via ILO, and even going in to settings> server I could reconnect the master node.

      The pattern seems to be that they always migrate the VM's off the master node, and reboots, but never seems to reconnect after. Only way to recover is to manually update each node and move VM's then go and reactivate HA.

      This also started a bout 2 months ago , and it has been working wonderful in the past. Maybe something changed?

      nikadeN D 2 Replies Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        It's hard to tell, are you using XOA or XO from the sources?

        T 1 Reply Last reply Reply Quote 0
        • nikadeN Offline
          nikade Top contributor @tuxpowered
          last edited by

          @tuxpowered justs to make sure, you're using the "Rolling pool update" button, right?
          And then the master is patched, vm's migrated off and then no bueno, correct?

          Did you happend to have any shared storage in this pool or is all storage local storage?

          T 1 Reply Last reply Reply Quote 0
          • D Offline
            dsiminiuk @tuxpowered
            last edited by dsiminiuk

            @tuxpowered I'm wondering if this is a network connectivity issue. 😎

            When the rolling pool update stops, what does your route table look like on the master (can it reach the other node)?

            Is your VPN layer 3 (routed), layer 2 (non-routed), IPSEC tunnel?

            T 1 Reply Last reply Reply Quote 0
            • T Offline
              tuxpowered @olivierlambert
              last edited by

              @olivierlambert said in Rolling Pool Update - host took too long to restart:

              Reply

              XO -CE .... XO from source

              1 Reply Last reply Reply Quote 0
              • T Offline
                tuxpowered @nikade
                last edited by

                @nikade Both clusters have shared storage. Kind of a pre-req to have a cluster 🙂
                Yes both systems were on line and working well. One is TrueNas SCALE and the other is a Qnap. Both massively over kill systems.

                nikadeN 1 Reply Last reply Reply Quote 0
                • T Offline
                  tuxpowered @dsiminiuk
                  last edited by

                  @dsiminiuk The local cluster has not VPN or anything like that, in fact its all on the same network so doesnt even hit the firewall or internet.

                  The remote location yes that is over IPSec, but that tunnel never goes down.
                  When the first note reboots (the master), I can see that the system is back up in 5-8 min. If I go in to XO > Settings > Servers and click the Enable/Disable status button to reconnect it pops right up. Again, does not resume migrating the other nodes.
                  If I leave it sit and wait for it to connect on its own some times it does sometimes it doesnt. (Same generally holds true when I reboot just a normal node not in HA.)

                  And the XOA - CE vm has 4 cores and 16GB of ram. and in the last year I have never used more than 6GB of ram. CPU usage is less than 15%, and network peak is 18MiB averaged out of the last year (taken from the Stats tab) so those are averaged out clearly.

                  I haven't looked at the route table while the issue is happening, as there does not appear to be any network issue. Other nodes are all reachable and manageable as well. its just the clusters rolling update that seems to not reconnect. But the next round of updates I will surely take a look

                  D 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Online
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by olivierlambert

                    We'll try to reproduce it internally, the code should try to reconnect every xx seconds so it's weird it doesn't work 🤔

                    edit: adding @pdonias in the convo, he might ask some XO logs to see more in depth.

                    1 Reply Last reply Reply Quote 0
                    • nikadeN Offline
                      nikade Top contributor @tuxpowered
                      last edited by

                      @tuxpowered said in Rolling Pool Update - host took too long to restart:

                      @nikade Both clusters have shared storage. Kind of a pre-req to have a cluster 🙂
                      Yes both systems were on line and working well. One is TrueNas SCALE and the other is a Qnap. Both massively over kill systems.

                      Yea I was hoping that you were using shared storage, but i've actually seen ppl using clusters/pools without shared storage so I felt I had to ask 🙂

                      1 Reply Last reply Reply Quote 0
                      • D Offline
                        dsiminiuk @tuxpowered
                        last edited by

                        @tuxpowered said in Rolling Pool Update - host took too long to restart:

                        When the first note reboots (the master), I can see that the system is back up in 5-8 min. If I go in to XO > Settings > Servers and click the Enable/Disable status button to reconnect it pops right up. Again, does not resume migrating the other nodes.

                        That is what I am seeing also, logged here https://xcp-ng.org/forum/topic/9683/rolling-pool-update-incomplete

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post