Rolling pool update failed to migrate VMs back

Neal

Hi,

I have a 5 host XCP 8.2 setup with 380GB per host, SAN, HA etc and XOA. Last night I performed a rolling pool update and it successfully worked through evacuating, updating and rebooting each host before starting to migrate VMs back to their original hosts. In the middle of migrating VMs back to their original host, multiple VMs failed with "not enough memory" errors. When I checked in the morning I had one host with only a few GB free RAM and other hosts with ~200GB free - very unbalanced and definitely not what I was expecting.

I've checked to forum and not found any other RPU issues reported with this stage of the process - if I've missed this please let me know.

With 1 host completely evacuated the remaining hosts were at ~85%, so there is plenty of space to shuffle VMs about, but I guess the RPU tried to move some VMs before enough had been shifted off the target to make room for them?

We manually distribute VMs for high availability and load balancing, so would ideally like them to return to their original locations automatically when done.

How can I ensure that the final "migrate VMs back" step completes successfully in the future?

The error was:

                    "message": "HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)",
                    "name": "XapiError",
                    "stack": "XapiError: HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)\n    at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1068:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1102:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1092:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1265:14)"

Thanks in advance for any pointers,
Neal.

olivierlambert

Hi,

It seems you don't have XOA but XO from the sources, right?

Also, I would check if you use dynamic memory for your VMs, that might complicated the life of RPU

Neal

Hi @olivierlambert,

Thanks for replying so quickly.

No we are using XOA, enterprise license, fully updated to version 5.102.1. None of our VMs use dynamic memory (I just double checked and for all of them memory-dynamic-max = memory-dynamic-min) as I've been burned by that and migrations in the past

Regards,
Neal.

olivierlambert

Thanks for the details. It's weird, I never heard of such RPU issues after all hosts where shuffled, in theory, it should use the same placement than before the RPU to be sure there's no surprise Could it be some halted VMs before the RPU with "auto restart on boot" enabled? That might explain it.

Neal

Hi,

Unfortunately not - I checked before starting for any halted VMs which were set to auto poweron with this filter in XOA.

auto_poweron? power_state:halted

I have another near identical setup (identical hardware, almost identical VM numbers, sizes etc) which I will be doing a RPU update on next week. I'll try the updates there and see if that pool has the same issues.

Thanks for confirming I'm not missing anything obvious.

Cheers,
Neal.

olivierlambert

Yeah, that would be great if you could follow the process and try to spot where things are going south

nikade

We've also experience trouble almost every time we've updated our pools, ever since the old XenServer days and Citrix kind of recommended "manual intervention" because there was no mechanism to check which hosts that were suitable before a VM is migrated.

I think there has been a lot of work done to XOA tho to handle this, but I might've been mistaken, we just ended up re-installing our hosts and setting up a new pool which we then live migrate our VM's over too and scrap the old ones.

VMWare has some kind of logic, which will try to balance the load between the hosts and if you have DRS it will even make sure to balance your hosts automatically during runtime.
Im pretty sure XOA has this logic as well, but XCP-NG center definately doesn't, so avoid it as much as possible.

Neal

Updating our second pool had the same issue. This time I stayed up to 1am to watch it. VMs are migrated "back" to their original host in the wrong order, causing some hosts to fill and therefore VM migrations to fail. Specifically, we have 5 hosts - xcp01, xcp02, xcp03, xcp04 and xcp05 - xcp02 is the pool master.

Hitting the RPU button drained each host, updated, rebooted and then repeated.

Update order and where VMs were drained to:

02 -> all
01 -> 02
03 -> 01
05 -> 03
04 -> 05

“move back” order (should be the reverse of the update order)

05->04
02->01 xcp01 Full!
01->03 xcp03 Full!
03->05
then multiple hosts to to 02 to finish up.

We are running our hosts at about 60% RAM used, and all our VMs have the same min and max set for dynamic memory so they can not shrink to make space. When one host is drained during the rolling upgrade and move back they are closer to 75% used. All our hosts have identical hardware and the pool master was xcp02 which explains why that was the first one to start and last to finish. We have a SAN in use so just the memory migrating, disk space not a factor.

Can you confirm the logic that XOA uses for deciding the order or migrations?

Thanks,
Neal.

Andrew

@olivierlambert I have also run into a different problem. When I start a rolling pool update and I want to make things move faster I'll also manually migrate VMs off of a server that is pending a reboot. The problem is XO will then also migrate the already moved VMs to a new different server. The process should check if the next VM to be migrated is actually still on the server to be rebooted, if not then it should know the VM has already been migrated off (for some reason) and not migrate it again for no reason.

It would also be nice to have a dynamic number of VMs to concurrently migrate. If the VMs are not busy and will be easy to migrate (ie, low active CPU and memory) then it should migrate more concurrently. And/or have a manual selection when you click the pool update button (dynamic/all/some #).

tjkreidl

Ever since the early days of XenServer, I have always done the upgrade procedure manually, starting of course with the pool master, and manually migrating VMs to other hosts to make sure they all remain running (tracking of course what VMs should run on what host (the so-called host affinity setting). This can be done on individual VMs with the command:
xe vm-param-set uuid=<vm_uuid> affinity=<host_uuid>
That way, you can make sure a all VMs are successfully migrated off any given host before it's updated.

Neal

@olivierlambert Any further thoughts on this? With the order the rolling pool upgrade seemed to use for migrating VMs back to their original host it looks to me like it would fail any time the servers in a pool were over 50% committed on RAM. Previously when running the RPU we would have been under 50% committed which may be why we have not seen this before.

I do not think there is anything special in our setup that would impact this, but obviously we are hitting some corner case that most do not. Would it be worth raising as a support ticket for XOA?

Cheers,
Neal.

olivierlambert

I'm AFK for multiple weeks, so I have 0 bandwidth. Please open a ticket, on my side I'm not aware of many similar reports (which would have been easier to fix then)

BenjiReis

@Neal hi

Is HA enabled in your pool? If so if there are VMs non protected by HA on your host you're trying to evacuate this is the cause of the error.

You can either :

set all VM to be HA protected before attempting the evacuation
disable HA for the time of the RPU and re-enable it after

Regards

Neal

@BenjiReis HA is automatically disabled by the RPU when it starts, then enabled at the end. - I see the task for that in the task log. We do have about 5 VMs that do not have HA enabled, but they are very small (~4GB each) so should not make any difference regardless of which hosts they were on.

Also all VMs successfully evacuate from the host for the updates, its only when VMs are migrated back after all hosts are upgraded that I see a problem.

I've raised a support ticket, if anything relevant comes out of it I'll try and report back here for future readers.

Thanks for the suggestions,
Neal.