Rolling Pool Update - not possible to resume a failed RPU

ecoutinho

Rolling Pool Update succeeded on the master but then failed on the first slave, due to yum caching errors. This resulted in a finished failed task of RPU. I've corrected the yum issue and manually updated and rebooted this slave.

Now, when I go to Pools > Patches, it shows there are no missing patches, so the RPU button is disabled. This is true for the master and one slave but the remaining slaves are still missing their patches. It seems XO is just checking the pool missing patches on the master, ignoring the other slaves.

I guess I have two options:

wait for the release of a new XCP-ng patch, so that the RPU is enabled once again
manually update each host

Shouldn't XO check for Pool missing patches on all hosts, and allow a RPU to be performed if there is any host with missing patches?

Orchestra, commit 6ecab
Master, commit 6ecab

olivierlambert

Hi,

In theory, the pool should be at the exact same level before RPU, so you are in a unplanned/not normal case.

Future updates will be easier to spot and then we could probably have a more resilient RPU check (adding @stormi in the loop)

DustinB

I had a similar experience with this latest round, due to a variety of issues that occurred, 2 of my hosts lost their gateway and started using the DHCP server, 1 VM wasn't on shared storage and couldn't get migrated, and then of course the Windows Driver 9.4.1 issue as all of my hosts where reporting "this VM doesn't contain the feature" (to live migrate) even though they were updated.

What you need to do is below

Migrate VMs off of your Pool Master
Patch your Pool Master
Reboot the Pool Master
Migrate VMs back onto the pool master, emptying a Slave
Patch said slave server
Reboot said slave
Repeat from the Migration step for each remaining host.

manilx

@DustinB Actually you don't have to migrate VM's if you're fine with them shutting down.
Just update the missing slaves by "yum update" and reboot them. VM's will be shut down and restarted (following their start settings in advanced)...

tjkreidl

@ecoutinho Performing manual updates is an option. The master probably checked the hosts for the hotfix uniformity before the rolling pool upgrade started, but since it failed to complete, you now have a discrepancy in the patch level for those two hosts. I had that happen once because of a root space error, which was a pain to deal with, and though I cannot recall the specific fix, I think I had to migrate the VMs to the updated hosts and do a whole new install after redoing the partition table (it was that dreaded extra "Dell" partition at the time that caused the issue).

tjkreidl

@manilx If the VMs can be shut down, yes, otherwise migrate the VMs. Luckily, you can migrate from a host with a lower hotfix level to one that has a higher level, but I do not believe the reverse is possible.

manilx

@tjkreidl True. VM's that can't be shutdown you'll have to shutdown manually.
I find that this takes a lot less time than migrating the VM's around. But if they need to be running then there's no other way ¯_(ツ)_/¯

Andrew

@olivierlambert Any backups that run cause RPU to fail and not continue/restart. My example of this is hourly continuous replication breaks RPU.

tjkreidl

@Andrew Right, backups should be shut off during the RPU process.

olivierlambert

I thought we already disabled that but I will ask internally

Tristis Oris

in case of manual RPU - maintenance mode is disabled after update\toolstack restart. So also need to disable balancer plugin.

olivierlambert

I think the load balancer is already disabled in the RPU

Tristis Oris

@olivierlambert During RPU - yes. i mean manual update in case of failure.