Rolling pool update failed to migrate VMs back
-
Hi,
I have a 5 host XCP 8.2 setup with 380GB per host, SAN, HA etc and XOA. Last night I performed a rolling pool update and it successfully worked through evacuating, updating and rebooting each host before starting to migrate VMs back to their original hosts. In the middle of migrating VMs back to their original host, multiple VMs failed with "not enough memory" errors. When I checked in the morning I had one host with only a few GB free RAM and other hosts with ~200GB free - very unbalanced and definitely not what I was expecting.
I've checked to forum and not found any other RPU issues reported with this stage of the process - if I've missed this please let me know.
With 1 host completely evacuated the remaining hosts were at ~85%, so there is plenty of space to shuffle VMs about, but I guess the RPU tried to move some VMs before enough had been shifted off the target to make room for them?
We manually distribute VMs for high availability and load balancing, so would ideally like them to return to their original locations automatically when done.
How can I ensure that the final "migrate VMs back" step completes successfully in the future?
The error was:
"message": "HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)", "name": "XapiError", "stack": "XapiError: HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)\n at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1068:24)\n at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1102:14\n at Array.forEach (<anonymous>)\n at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1092:12)\n at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1265:14)"
Thanks in advance for any pointers,
Neal. -
Hi,
It seems you don't have XOA but XO from the sources, right?
Also, I would check if you use dynamic memory for your VMs, that might complicated the life of RPU
-
Hi @olivierlambert,
Thanks for replying so quickly.
No we are using XOA, enterprise license, fully updated to version 5.102.1. None of our VMs use dynamic memory (I just double checked and for all of them memory-dynamic-max = memory-dynamic-min) as I've been burned by that and migrations in the past
Regards,
Neal. -
Thanks for the details. It's weird, I never heard of such RPU issues after all hosts where shuffled, in theory, it should use the same placement than before the RPU to be sure there's no surprise
Could it be some halted VMs before the RPU with "auto restart on boot" enabled? That might explain it.
-
Hi,
Unfortunately not - I checked before starting for any halted VMs which were set to auto poweron with this filter in XOA.
auto_poweron? power_state:halted
I have another near identical setup (identical hardware, almost identical VM numbers, sizes etc) which I will be doing a RPU update on next week. I'll try the updates there and see if that pool has the same issues.
Thanks for confirming I'm not missing anything obvious.
Cheers,
Neal. -
Yeah, that would be great if you could follow the process and try to spot where things are going south
-
We've also experience trouble almost every time we've updated our pools, ever since the old XenServer days and Citrix kind of recommended "manual intervention" because there was no mechanism to check which hosts that were suitable before a VM is migrated.
I think there has been a lot of work done to XOA tho to handle this, but I might've been mistaken, we just ended up re-installing our hosts and setting up a new pool which we then live migrate our VM's over too and scrap the old ones.
VMWare has some kind of logic, which will try to balance the load between the hosts and if you have DRS it will even make sure to balance your hosts automatically during runtime.
Im pretty sure XOA has this logic as well, but XCP-NG center definately doesn't, so avoid it as much as possible.