CPU pegged at 100% in several Rocky Linux 8 VMs without workload in guest

jgrafton

We recently encountered this issue during a migration from VMware.

Unfortunately, we've had to halt our migration until we can figure out what is happening to the VMs.

When the problem occurs, we see in the XOA interface (version 5.94.2) the guest VM CPU pegged at 100%.

The spike in CPU often happens after a migration to another host within a pool or to a different pool.

Sometimes the spike in CPU occurs randomly without an accompanying host to host migration.

With the pegged CPU, the guest VM is no longer accessible in any meaningful way.

All the services running in the VM go offline and the VM is no longer pingable.

Each of our pools uses lvmohba storage with several LUNs attached to each host in the pool.

We've seen the CPU spike occur on 5 VMs so far, all running Rocky 8.10 with the latest kernel (4.18.0-553.5.1.el8_10.x86_64).

We tested several older kernel revisions and encountered the same problem. (4.18.0-513.24.1.el8_9.x86_64)

It seems only the primary CPU (CPU0) is pegged at 100%.

On systems with more than a single cpu, we are able to ssh (or console) into the VM but it runs extremely slow. The guest is effectively unusable.

Running top on the VM shows no load from processes but CPU0 is at 100%. There is no appreciable I/O on the system.

Interestingly, on the XCP-ng host, the qemu process running the VM with the pegged CPU does not have a high load itself.

The pegged CPU appears to be contained entirely within the guest.

All of our XCP-ng hosts are running version 8.2.1.

All of the affected VMs are running version 8.2.0-2 of the management agent.

All the affected hosts were migrated from ESXi.

The affected VMs use a mix of UEFI and BIOS.

We've upgraded one of our problematic systems to Rocky 9 that has a 5.14 kernel to see if the newer kernel is affected.

We have roughly 100 VMs split across two pools.

Has anyone experienced a problem similar to this?

olivierlambert

Hi,

Can you check the VM template you are using? You should probably also ask Vates support via your subscription, that's probably a good reflex in this case, since community support can take longer

jgrafton

@olivierlambert We have an existing ticket (7726289) and the first suggestion was to validate the template. I created a new VM with the correct Rocky 8 template and attached the existing disk to it. Unfortunately, the problem still occurred a couple of days later.

I want to make clear I have no problem with the support we've received. It's just that this is such an intermittent and difficult to diagnose problem I wanted to see if anyone in the community had run into it.

olivierlambert

Understood, and good to know

So with Rocky 9, do you still have the issue? (kernel 5.14). I wonder if it could be a PV driver bug in the old kernel. It might be interesting to check if you can identify a specific process taking all that CPU

jgrafton

@olivierlambert That was my initial thought, PV driver in the older kernel. No process is using very much CPU in the guest though the total CPU is at 100% (when running top in the VM).

Haven't been able to get Rocky 9 to fail yet, but it can take a day or two.

olivierlambert

Keep us posted! I will try to start a Rocky 8 VM to see if it's doing this too. Anything in the xl dmesg?

jshiells

@jgrafton check your VM's CPU Steal time. my guess is whats where its going.

make sure vmware tools is not running/deleted

give the VM a reboot , should remove the steal time CPU usage, if that is the problem

we have seen this issue when hot migrating VM's between pools (xcp-ng to xcp-ng or XEN to xcp-ng)

jgrafton

@olivierlambert Nothing out of the ordinary in xl dmesg that I can tell.

@jshiells I'm pretty sure the VMs have had the Vmware tools removed since that's a part of our migration procedure but I'll double check.

Annoyingly, we haven't been able to get a VM to fail all day.

jshiells

@jgrafton the Steal time CPU usage "may not" have anything todo with vmware tools.

I have seen this happen by just hot migrating older linux systems form host to host inside the same pool... as well as hot migrating between two different pools. I have also seen the load balance plugin trigger this on old linux versions when it moves a VM from host to host. i honestly dont think it has anything todo with XCP-NG but more how the linux VM is dealing with the very short pause during migrations. == causes 100% cpu steal time to kick in.

jgrafton

@jshiells I was wrong, open-vm-tools is installed on a lot of the systems we migrated. I just assumed it wasn't instead of checking. We'll remove it from all the systems, test further, and report back. Thank you for the insight!

olivierlambert

Ah great catch and suggestion @jshiells ! It's not impossible previous VM tools are causing issues

jgrafton

@olivierlambert So it turns out this issue wasn't caused by open-vm-tools.

Even after uninstalling it from all our hosts in XCP, we still had several hosts climb to 100% CPU shortly after a migration.

While combing through sar logs and several crash dumps, I found that the system load would rapidly increase in a short amount of time until the host was unreachable.

I gathered from the crash dumps that the high load appeared to be caused by threads in spinlocks waiting on storage.

This led me to believe the older kernel (4.18) was having difficulty recovering from the migration process.

The simple fix was to upgrade the OS to Rocky 9 on some hosts and upgrade the kernel on ones not ready to have the OS upgraded.

We've been running for a couple weeks without an issue.

jshiells

@jgrafton its a good theory, just for awareness i have seen this problem on :

Debian 7,8,9
Ubuntu 18
CenOS 7,8
Alma 8

so it could be a xcp-ng and Kernel 4 issue but definitely not limited to centos/rocky/alma (same same)

oddly enough i have not seen this issue on CloudLinux 7,8

aflons

We experience the exact same issue with CloudLinux OS 8, seemingly random after live migration. This has been ongoing for years. Seems to happen far less now with shared storage.

My theory somehow the kernel and/or PVE module doesn't handle the freeze during live migration, longer freeze, more risk of this happening.

VMs start to crash random amount of time after live migration, never immideate. Could be hours, or days even, making it hard to diagnose. No crash dump, nothing, just 100% CPU on all cores and frozen console.

One consistent thing we see, that happens almost every time, is that top and other tools stop working, they are frozen in a state were no CPU load etc is reported, but there is load on the server.

We've been going back and forth with CloudLinux support and they did some changed to tuned profile regarding disk buffers/cache that made things at bit more stable but not gone 100%.

We don't see the same error in AlmaLinux 9 and CloudLinux OS 9.

More busy VM = more chance of happening. Uptime may be a factor, too.

laszlobortel

I am afraid that we have the same problem: ~90 Rocky8 VMs migrated from VMware, pegging one CPU very often. We have suspended further migration to XCP-ng due to this issue.
Has been the root cause identified since 2024? Is there a solution or workaround (apart from upgrading to Rocky9)?

jgrafton

@laszlobortel We never reached a definitive root cause and did end up fully migrating to XCP-NG from VMware.

We still have roughly 100 VMs running Rocky 8.10. The 4.18.0-553.94.1 kernels and above don't seem to have the same CPU issues but I'm not sure if that's because a kernel bug was mitigated or because we upgraded our backend storage to all flash arrays (Pure Storage C50's).

The CPU still gets pegged on a Rocky 8 VM every once in a blue moon but not often enough to warrant more time being spent tracking it down.

aflons

@laszlobortel we've seen far less of this issue since my last message, not sure what made it better and when. But we're still making sure to reboot monthly (during patching, as we normally do anyways) + after live migration, and that helps. We don't use load balancing, so once a VM is staying put on one hypervisor, there is no issue. Live migration and time triggers the issue for us.

What changed in our infra is upgrade to XCP-NG 8.3 and moving to XOSTOR as shared storage. We've seen no issue with AlmaLinux 9 and CloudLinux 9 at all. They also perform better I/O wise.

laszlobortel

@aflons @jgrafton First of all, I would like to thank very much both of you for replying so quickly to this old thread!
Our failure rate is roughly 1 frozen VM / 90 Rocky8 VMs / day, which is not tolerable. We have further hundreds of Rocky8 VMs on VMware, waiting for migration to XCP-ng.
I tried to summarise our options:

Our kernels are pretty fresh, but we can try the very latest available for Rocky 8.
Upgrading to Rocky 9 on the sort term is not an option. We have to migrate Rocky 8 from VMware to XCP-ng first, then we can think about switching to Rocky 9 later.
VMware tools removed during migration as part of the migration procedure.
We are aready on shared lvmohba storage, which is a production grade Hitachi Vantara all SSD, same as under VMware, so I see no room for change/improvement here.
As last resort we can try disable load-balancing plugin and reboot monthly during our maintenance window, but this would be an ugly workaround.

Is there anything I forgot?

@jgrafton Was there any useful suggestion or conclusion in your Vates support ticket #7726289? I am afraid that we are facing a tricky interworking issue between the xen hypervisor and the 4.18.0 kernel and both components are independent from XCP-ng and Vates.

aflons

@laszlobortel yes I definately think load balancing is the issue for you. Since live migrations is the biggest trigger.

olivierlambert

That would be an interesting lead to see if the issue is triggered by live migrations, this could be a hint on the issue.