Neal

Neal

@BenjiReis HA is automatically disabled by the RPU when it starts, then enabled at the end. - I see the task for that in the task log. We do have about 5 VMs that do not have HA enabled, but they are very small (~4GB each) so should not make any difference regardless of which hosts they were on.

Also all VMs successfully evacuate from the host for the updates, its only when VMs are migrated back after all hosts are upgraded that I see a problem.

I've raised a support ticket, if anything relevant comes out of it I'll try and report back here for future readers.

Thanks for the suggestions,
Neal.

Neal

@olivierlambert Any further thoughts on this? With the order the rolling pool upgrade seemed to use for migrating VMs back to their original host it looks to me like it would fail any time the servers in a pool were over 50% committed on RAM. Previously when running the RPU we would have been under 50% committed which may be why we have not seen this before.

I do not think there is anything special in our setup that would impact this, but obviously we are hitting some corner case that most do not. Would it be worth raising as a support ticket for XOA?

Cheers,
Neal.

Neal

Updating our second pool had the same issue. This time I stayed up to 1am to watch it. VMs are migrated "back" to their original host in the wrong order, causing some hosts to fill and therefore VM migrations to fail. Specifically, we have 5 hosts - xcp01, xcp02, xcp03, xcp04 and xcp05 - xcp02 is the pool master.

Hitting the RPU button drained each host, updated, rebooted and then repeated.

Update order and where VMs were drained to:

02 -> all
01 -> 02
03 -> 01
05 -> 03
04 -> 05

“move back” order (should be the reverse of the update order)

05->04
02->01 xcp01 Full!
01->03 xcp03 Full!
03->05
then multiple hosts to to 02 to finish up.

We are running our hosts at about 60% RAM used, and all our VMs have the same min and max set for dynamic memory so they can not shrink to make space. When one host is drained during the rolling upgrade and move back they are closer to 75% used. All our hosts have identical hardware and the pool master was xcp02 which explains why that was the first one to start and last to finish. We have a SAN in use so just the memory migrating, disk space not a factor.

Can you confirm the logic that XOA uses for deciding the order or migrations?

Thanks,
Neal.

Neal

Hi,

Unfortunately not - I checked before starting for any halted VMs which were set to auto poweron with this filter in XOA.

auto_poweron? power_state:halted

I have another near identical setup (identical hardware, almost identical VM numbers, sizes etc) which I will be doing a RPU update on next week. I'll try the updates there and see if that pool has the same issues.

Thanks for confirming I'm not missing anything obvious.

Cheers,
Neal.

Neal

Hi @olivierlambert,

Thanks for replying so quickly.

No we are using XOA, enterprise license, fully updated to version 5.102.1. None of our VMs use dynamic memory (I just double checked and for all of them memory-dynamic-max = memory-dynamic-min) as I've been burned by that and migrations in the past

Regards,
Neal.

Neal

Hi,

I have a 5 host XCP 8.2 setup with 380GB per host, SAN, HA etc and XOA. Last night I performed a rolling pool update and it successfully worked through evacuating, updating and rebooting each host before starting to migrate VMs back to their original hosts. In the middle of migrating VMs back to their original host, multiple VMs failed with "not enough memory" errors. When I checked in the morning I had one host with only a few GB free RAM and other hosts with ~200GB free - very unbalanced and definitely not what I was expecting.

I've checked to forum and not found any other RPU issues reported with this stage of the process - if I've missed this please let me know.

With 1 host completely evacuated the remaining hosts were at ~85%, so there is plenty of space to shuffle VMs about, but I guess the RPU tried to move some VMs before enough had been shifted off the target to make room for them?

We manually distribute VMs for high availability and load balancing, so would ideally like them to return to their original locations automatically when done.

How can I ensure that the final "migrate VMs back" step completes successfully in the future?

The error was:

                    "message": "HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)",
                    "name": "XapiError",
                    "stack": "XapiError: HOST_NOT_ENOUGH_FREE_MEMORY(34642853888, 3430486016)\n    at Function.wrap (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_XapiError.mjs:16:12)\n    at default (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/_getTaskResult.mjs:13:29)\n    at Xapi._addRecordToCache (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1068:24)\n    at file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1102:14\n    at Array.forEach (<anonymous>)\n    at Xapi._processEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1092:12)\n    at Xapi._watchEvents (file:///usr/local/lib/node_modules/xo-server/node_modules/xen-api/index.mjs:1265:14)"

Thanks in advance for any pointers,
Neal.

Neal

Hi,

I have an instance of XOA managing a XCP-ng pool at a datacentre, and another XCP-ng pool in the office for dev/test servers. Both XCP-ng hosts and the XOA appliance have private IP addresses and there is a NAT and IP allowlists for connections between sites.

Ideally I would like to manage the XCP-ng pool at the office via the XOA at the datacentre - but not sure how XOA will deal with connecting to a NATed address if for example the master host changes since the XCP-ng pool has no way to tell XOA the new masters IP.

Is the best way to approach this to NAT a public IP/port at the office to whichever XCP-ng host is the master, and change it manually as/when the pool master changes (hopefully infrequently)?

Thanks.,
Neal.

Neal

Thanks @olivierlambert.

It may be worth updating the documentation and the welcome email when starting a trial then, as the welcome email "10 tips to start with Xen Orchestra" also says to check the guidance column to see if a host needs to be restarted for each update (tip #8).

Cheers,
Neal.

Neal

Hi,

I'm currently in a trial of XO and trying to understand the update processes for the XCP-ng host servers. Any pointers appreciated if I'm just not looking at the right docs.

The documentation has a screenshot showing a guidance column indicating if a restart is required, however when I view updates either via the pool updates page or directly on a host in Xen Orchestra I have a different set of columns (eg release number instead of release date, Size of update instead of Guidance). How can I easily determine if any of the updates require a host restart? One of the updates I'm offered is the kernel which I would expect to require a restart, another is "Xen Hypervisor Domain 0 libraries" which mentions a restart in its changelog.

The same page also states that "All the hosts in a pool must run the same XCP-ng version" which makes sense. How close does it need to be? eg is 8.2 OK regardless of patches, or do patches for the Xen hypervisor also need to match? If so is there an easy way to get a host to match the updates deployed to a pool (eg new server to add to pool but don't want to do the update/reboot dance on all existing servers right now...).

Thanks in advance,
Neal.

Neal

@Neal

Latest posts made by Neal