Update strategy for a consistent XCP-ng pool
-
Hello the XCP-ng community,
We are running XCP-ng on 9 hosts in 3 pools. To maintain continuous operation of the cluster, we perform rolling updates for security and other fixes, one host at a time, in a weekly maintenance window. The whole process typically takes about 3-4 windows, i.e. spans over 2-3 weeks. If a new update is published during that time, version skew can occur between some of the components installed on the hosts, and it has already happened to us that the skew resulted in a disruption of the cluster operation - specifically, VM backup via XenOrchestra stopped working. (And yes, we did follow the documentation on upgrading the pool master first.)
Is there some good practice for this kind of scenario, to make sure that the update cycle will result in consistent versions installed across the cluster? I can imagine that one could record the package versions installed by yum upgrade on the first host and then script the update on the subsequent hosts to use the same package versions, but maybe there is a better way?
Thank you.
-
Hi,
Thanks for your feedback. Can you explain why you are not updating in the same scheduled window? Because it's not meant to have a pool with different updates, that is known to potential cause issues (as you discovered) -
Hi @olivierlambert - the reason is that we typically bundle the physical host reboot with other updates (e.g. host firmware, SW running in the host's VMs). Also, the software stack running in the VMs on the hosts often requires special care when shutting down (for example Kubernetes node VMs running production workloads where some components are a bit fragile, Ceph filesystem, which is HA, but may take long time to recover after a node is taken down etc.) In many cases, we also cannot use VM migration - epecially for VMs using large local storage. So far, the procedure has been scheduling a 2-hour maintenance window every week, which typically allows us to update 2-3 hosts. I have read this post https://xcp-ng.org/forum/topic/7200/patching-to-a-specific-version/4 , but digging into the behavior of
yum update
, it looks like it cannot update to a specific version (unlikeyum install
). -
Adding @stormi in the loop so we can think about something.
My first approach would be to reduce the pool size then, so if you have a 3 hosts pool, it will fit in the maintenance window to get all the nodes fully up to date and consistent.
-
Of course, we can also try negotiating the expansion of the window with the business / management, so that the entire pool update fits in it. But it may not solve the entire problem, as there are various other preparation processes that need to happen between the maintenance windows. Also, breaking a pool is inconvenient - it complicates the management and reduces the options to move VMs around (we make limited, but essential use of that). I am asking here, hoping we can find some technical solution within the existing XCP-ng features.
-
You could host a local mirror (look up
reposync
), update from it, and then stop syncing it when you start your maintenance operations.However, I must stress that it's not good for a pool to be in an heterogeneous state for so long.
-
@stormi and @olivierlambert thank you for your advice.
I did some exploration on my side, too, and I think we have two workable strategies:
-
Use
reposync
mirror forxcp-ng-base
andxcp-ng-updates
repos on a shared filesystem visible by all hosts. Sync it, update the master, stop syncing and gradually update the remaining hosts. -
Use a variation of the
rpm -qa
-based approach discussed earlier - update the master, collect the package state withrpm -qa > reference.pkglist
, for each of the remaining hostsyum upgrade-to $(cat reference.pkglist)
, check withyum check-update
oryum --assumeno upgrade
for any irregularities, e.g. due to packages installed on some hosts only, and resolve these manually.
That's a good point about a pool in heterogeneous state too long - we will definitely reconsider our maintenance procedures.
We will try this approach in our upcoming maintenance and report here how we fared. -
-
Thanks, and also thank you for your feedback, it's important to understand the pain points to improve our product
Keep us posted!