Solving live migration crashes in RHEL 8.8 and derivatives

Devblog Jun 9, 2023

Welcome to our behind-the-scenes tech adventure! We're tackling a tricky crash issue that happened when live migrating Rocky Linux 8.8 (and other RHEL 8.8 derivatives) between two hosts. In less than two days, thanks to some great teamwork and the power of open-source, we found the problem and zapped it. Buckle up, and let's dive into this quick tale of tech troubleshooting and great teamwork.

⚠️
Until the issue is fixed in RHEL 8.8 and its derivatives, do not suspend or migrate a live VM running on one of these.
The issue was fixed in CentOS 8 Stream already, so we can expect it will be included in the next kernel update for RHEL 8.8.
See also "What to do, as a user", below.

💥 The problem

As highlighted by Weppel in a forum post, beginning with Rocky Linux 8.8, instances of live migration would invariably lead to the crashing of the virtual machine. Notably, this problem was absent prior to the 8.8 update.

Our initial hypothesis suggested a potential issue with the kernel. To validate this, we conducted a series of tests on both Rocky Linux and Alma Linux, utilizing various kernel versions. The issue was consistently replicated across both Linux distributions and could also be induced by using the suspend/resume function, as it's utilized during live migration.

🎯 Tracking the root cause

During our tests, we obtained a call stack and shared it with Xen developers. A quick analysis by Roger Pau Monné revealed that a patch was missing. Thanks to the detailed RPM changelogs, we were able to trace the version wherein this issue first surfaced and when the missing patch was integrated into the CentOS 8 Stream repository.

Starting with 4.18.0-466.el8 the patch patch: x86/idt: Annotate alloc_intr_gate() with __init was integrated in the kernel package, but there was a prerequisite to have x86/xen: Split HVM vector callback setup and interrupt gate allocation which was not integrated at the same time.

The migration from an 8.7 to a 8.8 will switch your installation to a kernel 4.18.0-477 that still has not the missing patch.

According to the RPMs changelog, the first version to integrate it is the 4.18.0-488 that was found in CentOS 8 Stream, we tested that by installing it manually and confirmed the issue is gone with it.

🛠️ What to do, as a user

If you are using RHEL 8 or a derivative and are currently in version 8.7 or lower, you might want to wait for the next kernel update in 8.8, before updating.

If you are already running on version 8.8, or need to update to 8.8 quickly (to apply security fixes, for example), then you have several options. None is perfect, as they all workaround a bug whose proper fix will be through future official RHEL(-like) updates:

  • Wait for the next kernel update, and avoid any suspend operation on your VM. This also means no operation which suspends the VM behind the scenes, such as live migration either, or backups with RAM in Xen Orchestra.

OR

  • Temporarily switch to another kernel, such as kernel-lt. Make sure the chosen kernel is suitable for your needs.

OR

  • Download and install the latest kernel update from CentOS 8 Stream. It should be compatible with any RHEL-like distribution, but be aware that it will not have received the same level of QA as a RHEL update, yet, as such packages first go into CentOS 8 Stream (with some level of testing, still), and only later may become official RHEL updates.
💡
Remember to exercise caution and thoroughly test any changes in a controlled environment before implementing them in a live setting.

🚀 Conclusion

Solving complex problems such as these truly underscores the power of community collaboration. With the collective effort of dedicated individuals from various backgrounds, we've managed to navigate through the complications of this live migration issue.

A big thank you to everyone who contributed to this solution. Your hard work and perseverance underscore the strength and power of our community. It's through our collaborative efforts that we're able to improve, adapt, and continue pushing the boundaries of what our systems can do. By sharing our insights, we hope to assist others who may encounter similar issues in their IT environments.

So, let's continue to learn, share, and grow together as a community. Here's to many more collaborative victories in the future!

Tags

David Morel

Hypervisor & Kernel Software Engineer at Vates and XCP-ng Security Coordinator. Open Source enthousiast, using IRC for everything. Raccoons lover.