Live migrate of Rocky Linux 8.8 VM crashes/reboots VM

Weppel

I've updated a few VM's from Rocky Linux 8.7 to 8.8. Live migrating these 8.8 VM's now causes a reboot of the VM instead of a regular live migrate. This issue doesn't happen with Rocky Linux 8.7.

I've tried updating the xe-tools to the latest version available on github (7.20.2-1), this didn't solve the issue.

Is this a known issue already? Any idea how to debug this further?

bleader

So, after our investigations, we were able to pinpoint the issue.

It seem to happen on most RHEL derivative distributions when migrating from 8.7 to 8.8. As suggested, the bug is in the kernel.

Starting with 4.18.0-466.el8 the patch: x86/idt: Annotate alloc_intr_gate() with __init is integrated and will create the issue. It is missing x86/xen: Split HVM vector callback setup and interrupt gate allocation that should have been integrated as well.

The migration to 8.8 will move you to 4.18.0-477.* versions that are also raising this issue, that's what you reported.

We found that the 4.18.0-488 that can be found in CentOS 8 Stream integrates the missing patch, and do indeed work when installed manually.

Your report helped us identify and reproduce the issues. That allowed us to provide a callstack to Xen devs. Then Roger Pau Monné found that it was this patch missing quickly, and we were able to find which versions of the kernel RPMs were integrating it and when the fix was integrated.

This means the issue was identified on RH side, and it is now a matter of having an updated kernel in derivative distributions like Rocky and Alma.

olivierlambert

Hi,

We managed to reproduce the issue. We are investigating, thinking on a bug on 8.8 or a component inside it.

Weppel

@olivierlambert Thank you very much for the quick follow-up.

I've done some testing with a colleague and it looks to be kernel related. The stock Rocky Linux 8.8 kernel (4.18.0-477.13.1.el8_8.x86_64) causes the reboot to happen. Upgrading the kernel to kernel-lt (5.4.245-1.1.el8.elrepo.x86_64) allows the VM to be live-migrated again without reboot/crash.

olivierlambert

Thanks for the investigation. I think that might worth an upstream report to Rocky, right?

bogikornel

I had the same problem with Oracle Linux 8. Kernel versions: 4.18.0-477.10 or 4.18.0-477.13.
Previous kernel series, and UEK kernel, no such problem.

This is what I saw in the log:

Jun 8 09:40:31 gpool05-nodeA xenopsd-xc: [error||4030359 ||backtrace] Connection to VM console R:a08260fac6a2 failed, exception Xenops_interface.Xenopsd_error([S(Does_not_exist);[S(VM);S(87b062dd-b53f-04d7-c49d-32deec6716d6/config)]])
Jun 8 09:40:31 gpool05-nodeA xenopsd-xc: [error||4030359 ||backtrace] Raised Xenops_interface.Xenopsd_error([S(Does_not_exist);[S(VM);S(87b062dd-b53f-04d7-c49d-32deec6716d6/config)]])
Jun 8 09:40:31 gpool05-nodeA xenopsd-xc: [error||4030359 ||xenops_interface] Xenops_interface.Xenopsd_error([S(Does_not_exist);[S(VM);S(87b062dd-b53f-04d7-c49d-32deec6716d6/config)]]) (File "xen/xenops_interface.ml", line 165, characters 51-58)

Weppel

@olivierlambert

I've created a bug report at Rocky Linux: https://bugs.rockylinux.org/view.php?id=3565

Feel free to add to this if I missed any relevant information.

stormi

@Weppel It looks like you wrote KVM instead of Xen in the title.

Weppel

@stormi Thanks for noticing, my bad

stormi

Adding @bleader to the discussion. He's trying to debug the issue.

It looks like a simple VM suspend + resume also crashes the VM. Do you confirm, @Weppel?

Weppel

@stormi Confirmed

bleader

So, after our investigations, we were able to pinpoint the issue.

It seem to happen on most RHEL derivative distributions when migrating from 8.7 to 8.8. As suggested, the bug is in the kernel.

Starting with 4.18.0-466.el8 the patch: x86/idt: Annotate alloc_intr_gate() with __init is integrated and will create the issue. It is missing x86/xen: Split HVM vector callback setup and interrupt gate allocation that should have been integrated as well.

The migration to 8.8 will move you to 4.18.0-477.* versions that are also raising this issue, that's what you reported.

We found that the 4.18.0-488 that can be found in CentOS 8 Stream integrates the missing patch, and do indeed work when installed manually.

Your report helped us identify and reproduce the issues. That allowed us to provide a callstack to Xen devs. Then Roger Pau Monné found that it was this patch missing quickly, and we were able to find which versions of the kernel RPMs were integrating it and when the fix was integrated.

This means the issue was identified on RH side, and it is now a matter of having an updated kernel in derivative distributions like Rocky and Alma.

Weppel

@bleader Thank you very much for the quick discovery of this, impressive work! I'm glad I could help!

Weppel

FYI this is not fixed yet in the latest EL kernel 4.18.0-477.15.1.el8_8.x86_64

olivierlambert

Thanks for keeping us posted…

Hopefully things will be fixed at some point. Maybe Red Hat is focused on doing other things right now…

bleader

Makes sense, I was hoping they would go with a -488 on update... Sorry to hear its not the case

qnx

@bleader Unfortunately, I think we're stuck on -477 until EL 8.9 comes out

it's frustrating that it's taking them so long to fix this, especially when it seems like the bug was caused by a human error to begin with.

bleader

Not entirely sure, but that may be related to what's happening on redhat side of things

KFC-Netearth

Just installed the latest kernel from Rocky and a live migrate seems to work on the 2 dev servers I have tried so far :
4.18.0-477.21.1.el8_8.x86_64 #1 SMP Tue Aug 8 21:30:09 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

bleader

@KFC-Netearth Surprising as it is still .477 but maybe the patch was backported, thanks for telling us!

Weppel

@KFC-Netearth

The Rocky Linux bugtracker indeed mentions it's mostly fixed, but there are still some kernel errors present: https://bugs.rockylinux.org/view.php?id=3565#c4293