Update strategy for a consistent XCP-ng pool

borivoj-tydlitat

Hello the XCP-ng community,

We are running XCP-ng on 9 hosts in 3 pools. To maintain continuous operation of the cluster, we perform rolling updates for security and other fixes, one host at a time, in a weekly maintenance window. The whole process typically takes about 3-4 windows, i.e. spans over 2-3 weeks. If a new update is published during that time, version skew can occur between some of the components installed on the hosts, and it has already happened to us that the skew resulted in a disruption of the cluster operation - specifically, VM backup via XenOrchestra stopped working. (And yes, we did follow the documentation on upgrading the pool master first.)

Is there some good practice for this kind of scenario, to make sure that the update cycle will result in consistent versions installed across the cluster? I can imagine that one could record the package versions installed by yum upgrade on the first host and then script the update on the subsequent hosts to use the same package versions, but maybe there is a better way?

Thank you.

olivierlambert

Hi,
Thanks for your feedback. Can you explain why you are not updating in the same scheduled window? Because it's not meant to have a pool with different updates, that is known to potential cause issues (as you discovered)

borivoj-tydlitat

Hi @olivierlambert - the reason is that we typically bundle the physical host reboot with other updates (e.g. host firmware, SW running in the host's VMs). Also, the software stack running in the VMs on the hosts often requires special care when shutting down (for example Kubernetes node VMs running production workloads where some components are a bit fragile, Ceph filesystem, which is HA, but may take long time to recover after a node is taken down etc.) In many cases, we also cannot use VM migration - epecially for VMs using large local storage. So far, the procedure has been scheduling a 2-hour maintenance window every week, which typically allows us to update 2-3 hosts. I have read this post https://xcp-ng.org/forum/topic/7200/patching-to-a-specific-version/4 , but digging into the behavior of yum update, it looks like it cannot update to a specific version (unlike yum install).

olivierlambert

Adding @stormi in the loop so we can think about something.

My first approach would be to reduce the pool size then, so if you have a 3 hosts pool, it will fit in the maintenance window to get all the nodes fully up to date and consistent.

borivoj-tydlitat

Of course, we can also try negotiating the expansion of the window with the business / management, so that the entire pool update fits in it. But it may not solve the entire problem, as there are various other preparation processes that need to happen between the maintenance windows. Also, breaking a pool is inconvenient - it complicates the management and reduces the options to move VMs around (we make limited, but essential use of that). I am asking here, hoping we can find some technical solution within the existing XCP-ng features.

stormi

You could host a local mirror (look up reposync), update from it, and then stop syncing it when you start your maintenance operations.

However, I must stress that it's not good for a pool to be in an heterogeneous state for so long.

borivoj-tydlitat

@stormi and @olivierlambert thank you for your advice.

I did some exploration on my side, too, and I think we have two workable strategies:

Use reposync mirror for xcp-ng-base and xcp-ng-updates repos on a shared filesystem visible by all hosts. Sync it, update the master, stop syncing and gradually update the remaining hosts.
Use a variation of the rpm -qa-based approach discussed earlier - update the master, collect the package state with rpm -qa > reference.pkglist, for each of the remaining hosts yum upgrade-to $(cat reference.pkglist), check with yum check-update or yum --assumeno upgrade for any irregularities, e.g. due to packages installed on some hosts only, and resolve these manually.

That's a good point about a pool in heterogeneous state too long - we will definitely reconsider our maintenance procedures.
We will try this approach in our upcoming maintenance and report here how we fared.

olivierlambert

Thanks, and also thank you for your feedback, it's important to understand the pain points to improve our product

Keep us posted!