XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Rolling pool update failed to migrate VMs back

    Scheduled Pinned Locked Moved Xen Orchestra
    14 Posts 6 Posters 364 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • N Offline
      Neal
      last edited by

      Hi,

      I have a 5 host XCP 8.2 setup with 380GB per host, SAN, HA etc and XOA. Last night I performed a rolling pool update and it successfully worked through evacuating, updating and rebooting each host before starting to migrate VMs back to their original hosts. In the middle of migrating VMs back to their original host, multiple VMs failed with "not enough memory" errors. When I checked in the morning I had one host with only a few GB free RAM and other hosts with ~200GB free - very unbalanced and definitely not what I was expecting.

      I've checked to forum and not found any other RPU issues reported with this stage of the process - if I've missed this please let me know.

      With 1 host completely evacuated the remaining hosts were at ~85%, so there is plenty of space to shuffle VMs about, but I guess the RPU tried to move some VMs before enough had been shifted off the target to make room for them?

      We manually distribute VMs for high availability and load balancing, so would ideally like them to return to their original locations automatically when done.

      How can I ensure that the final "migrate VMs back" step completes successfully in the future?

      The error was:

                          "message": "HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)",
                          "name": "XapiError",
                          "stack": "XapiError: HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)\n    at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1068:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1102:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1092:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1265:14)"
      

      Thanks in advance for any pointers,
      Neal.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        It seems you don't have XOA but XO from the sources, right?

        Also, I would check if you use dynamic memory for your VMs, that might complicated the life of RPU 🙂

        N 1 Reply Last reply Reply Quote 0
        • N Offline
          Neal @olivierlambert
          last edited by

          Hi @olivierlambert,

          Thanks for replying so quickly.

          No we are using XOA, enterprise license, fully updated to version 5.102.1. None of our VMs use dynamic memory (I just double checked and for all of them memory-dynamic-max = memory-dynamic-min) as I've been burned by that and migrations in the past 🙂

          Regards,
          Neal.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Thanks for the details. It's weird, I never heard of such RPU issues after all hosts where shuffled, in theory, it should use the same placement than before the RPU to be sure there's no surprise 🤔 Could it be some halted VMs before the RPU with "auto restart on boot" enabled? That might explain it.

            N A 2 Replies Last reply Reply Quote 0
            • N Offline
              Neal @olivierlambert
              last edited by

              Hi,

              Unfortunately not - I checked before starting for any halted VMs which were set to auto poweron with this filter in XOA.

              auto_poweron? power_state:halted
              

              I have another near identical setup (identical hardware, almost identical VM numbers, sizes etc) which I will be doing a RPU update on next week. I'll try the updates there and see if that pool has the same issues.

              Thanks for confirming I'm not missing anything obvious.

              Cheers,
              Neal.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Yeah, that would be great if you could follow the process and try to spot where things are going south 🤔

                N 2 Replies Last reply Reply Quote 0
                • nikadeN Offline
                  nikade Top contributor
                  last edited by

                  We've also experience trouble almost every time we've updated our pools, ever since the old XenServer days and Citrix kind of recommended "manual intervention" because there was no mechanism to check which hosts that were suitable before a VM is migrated.

                  I think there has been a lot of work done to XOA tho to handle this, but I might've been mistaken, we just ended up re-installing our hosts and setting up a new pool which we then live migrate our VM's over too and scrap the old ones.

                  VMWare has some kind of logic, which will try to balance the load between the hosts and if you have DRS it will even make sure to balance your hosts automatically during runtime.
                  Im pretty sure XOA has this logic as well, but XCP-NG center definately doesn't, so avoid it as much as possible.

                  1 Reply Last reply Reply Quote 0
                  • N Offline
                    Neal @olivierlambert
                    last edited by

                    Updating our second pool had the same issue. This time I stayed up to 1am to watch it. VMs are migrated "back" to their original host in the wrong order, causing some hosts to fill and therefore VM migrations to fail. Specifically, we have 5 hosts - xcp01, xcp02, xcp03, xcp04 and xcp05 - xcp02 is the pool master.

                    Hitting the RPU button drained each host, updated, rebooted and then repeated.

                    Update order and where VMs were drained to:

                    • 02 -> all
                    • 01 -> 02
                    • 03 -> 01
                    • 05 -> 03
                    • 04 -> 05

                    “move back” order (should be the reverse of the update order)

                    • 05->04
                    • 02->01 xcp01 Full!
                    • 01->03 xcp03 Full!
                    • 03->05
                    • then multiple hosts to to 02 to finish up.

                    We are running our hosts at about 60% RAM used, and all our VMs have the same min and max set for dynamic memory so they can not shrink to make space. When one host is drained during the rolling upgrade and move back they are closer to 75% used. All our hosts have identical hardware and the pool master was xcp02 which explains why that was the first one to start and last to finish. We have a SAN in use so just the memory migrating, disk space not a factor.

                    Can you confirm the logic that XOA uses for deciding the order or migrations?

                    Thanks,
                    Neal.

                    1 Reply Last reply Reply Quote 0
                    • A Online
                      Andrew Top contributor @olivierlambert
                      last edited by

                      @olivierlambert I have also run into a different problem. When I start a rolling pool update and I want to make things move faster I'll also manually migrate VMs off of a server that is pending a reboot. The problem is XO will then also migrate the already moved VMs to a new different server. The process should check if the next VM to be migrated is actually still on the server to be rebooted, if not then it should know the VM has already been migrated off (for some reason) and not migrate it again for no reason.

                      It would also be nice to have a dynamic number of VMs to concurrently migrate. If the VMs are not busy and will be easy to migrate (ie, low active CPU and memory) then it should migrate more concurrently. And/or have a manual selection when you click the pool update button (dynamic/all/some #).

                      1 Reply Last reply Reply Quote 0
                      • tjkreidlT Offline
                        tjkreidl Ambassador
                        last edited by

                        Ever since the early days of XenServer, I have always done the upgrade procedure manually, starting of course with the pool master, and manually migrating VMs to other hosts to make sure they all remain running (tracking of course what VMs should run on what host (the so-called host affinity setting). This can be done on individual VMs with the command:
                        xe vm-param-set uuid=<vm_uuid> affinity=<host_uuid>
                        That way, you can make sure a all VMs are successfully migrated off any given host before it's updated.

                        1 Reply Last reply Reply Quote 0
                        • N Offline
                          Neal @olivierlambert
                          last edited by

                          @olivierlambert Any further thoughts on this? With the order the rolling pool upgrade seemed to use for migrating VMs back to their original host it looks to me like it would fail any time the servers in a pool were over 50% committed on RAM. Previously when running the RPU we would have been under 50% committed which may be why we have not seen this before.

                          I do not think there is anything special in our setup that would impact this, but obviously we are hitting some corner case that most do not. Would it be worth raising as a support ticket for XOA?

                          Cheers,
                          Neal.

                          BenjiReisB 1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            I'm AFK for multiple weeks, so I have 0 bandwidth. Please open a ticket, on my side I'm not aware of many similar reports (which would have been easier to fix then)

                            1 Reply Last reply Reply Quote 0
                            • BenjiReisB Offline
                              BenjiReis Vates 🪐 XCP-ng Team @Neal
                              last edited by

                              @Neal hi

                              Is HA enabled in your pool? If so if there are VMs non protected by HA on your host you're trying to evacuate this is the cause of the error.

                              You can either :

                              • set all VM to be HA protected before attempting the evacuation
                              • disable HA for the time of the RPU and re-enable it after

                              Regards

                              N 1 Reply Last reply Reply Quote 0
                              • N Offline
                                Neal @BenjiReis
                                last edited by

                                @BenjiReis HA is automatically disabled by the RPU when it starts, then enabled at the end. - I see the task for that in the task log. We do have about 5 VMs that do not have HA enabled, but they are very small (~4GB each) so should not make any difference regardless of which hosts they were on.

                                Also all VMs successfully evacuate from the host for the updates, its only when VMs are migrated back after all hosts are upgraded that I see a problem.

                                I've raised a support ticket, if anything relevant comes out of it I'll try and report back here for future readers.

                                Thanks for the suggestions,
                                Neal.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post