XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CPU pegged at 100% in several Rocky Linux 8 VMs without workload in guest

    Scheduled Pinned Locked Moved Compute
    28 Posts 7 Posters 4.9k Views 8 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      jshiells @jgrafton
      last edited by

      @jgrafton check your VM's CPU Steal time. my guess is whats where its going.

      make sure vmware tools is not running/deleted

      give the VM a reboot , should remove the steal time CPU usage, if that is the problem

      we have seen this issue when hot migrating VM's between pools (xcp-ng to xcp-ng or XEN to xcp-ng)

      jgraftonJ 1 Reply Last reply Reply Quote 0
      • jgraftonJ Offline
        jgrafton @jshiells
        last edited by

        @olivierlambert Nothing out of the ordinary in xl dmesg that I can tell.

        @jshiells I'm pretty sure the VMs have had the Vmware tools removed since that's a part of our migration procedure but I'll double check.

        Annoyingly, we haven't been able to get a VM to fail all day.

        J 1 Reply Last reply Reply Quote 0
        • J Offline
          jshiells @jgrafton
          last edited by

          @jgrafton the Steal time CPU usage "may not" have anything todo with vmware tools.

          I have seen this happen by just hot migrating older linux systems form host to host inside the same pool... as well as hot migrating between two different pools. I have also seen the load balance plugin trigger this on old linux versions when it moves a VM from host to host. i honestly dont think it has anything todo with XCP-NG but more how the linux VM is dealing with the very short pause during migrations. == causes 100% cpu steal time to kick in.

          jgraftonJ 1 Reply Last reply Reply Quote 0
          • jgraftonJ Offline
            jgrafton @jshiells
            last edited by

            @jshiells I was wrong, open-vm-tools is installed on a lot of the systems we migrated. I just assumed it wasn't instead of checking. We'll remove it from all the systems, test further, and report back. Thank you for the insight!

            1 Reply Last reply Reply Quote 1
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Ah great catch and suggestion @jshiells ! It's not impossible previous VM tools are causing issues 🙂

              jgraftonJ 1 Reply Last reply Reply Quote 0
              • jgraftonJ Offline
                jgrafton @olivierlambert
                last edited by

                @olivierlambert So it turns out this issue wasn't caused by open-vm-tools.

                Even after uninstalling it from all our hosts in XCP, we still had several hosts climb to 100% CPU shortly after a migration.

                While combing through sar logs and several crash dumps, I found that the system load would rapidly increase in a short amount of time until the host was unreachable.

                I gathered from the crash dumps that the high load appeared to be caused by threads in spinlocks waiting on storage.

                This led me to believe the older kernel (4.18) was having difficulty recovering from the migration process.

                The simple fix was to upgrade the OS to Rocky 9 on some hosts and upgrade the kernel on ones not ready to have the OS upgraded.

                We've been running for a couple weeks without an issue.

                J 1 Reply Last reply Reply Quote 1
                • J Offline
                  jshiells @jgrafton
                  last edited by

                  @jgrafton its a good theory, just for awareness i have seen this problem on :

                  • Debian 7,8,9
                  • Ubuntu 18
                  • CenOS 7,8
                  • Alma 8

                  so it could be a xcp-ng and Kernel 4 issue but definitely not limited to centos/rocky/alma (same same)

                  oddly enough i have not seen this issue on CloudLinux 7,8

                  1 Reply Last reply Reply Quote 0
                  • A Offline
                    aflons
                    last edited by aflons

                    We experience the exact same issue with CloudLinux OS 8, seemingly random after live migration. This has been ongoing for years. Seems to happen far less now with shared storage.

                    My theory somehow the kernel and/or PVE module doesn't handle the freeze during live migration, longer freeze, more risk of this happening.

                    VMs start to crash random amount of time after live migration, never immideate. Could be hours, or days even, making it hard to diagnose. No crash dump, nothing, just 100% CPU on all cores and frozen console.

                    One consistent thing we see, that happens almost every time, is that top and other tools stop working, they are frozen in a state were no CPU load etc is reported, but there is load on the server.

                    We've been going back and forth with CloudLinux support and they did some changed to tuned profile regarding disk buffers/cache that made things at bit more stable but not gone 100%.

                    We don't see the same error in AlmaLinux 9 and CloudLinux OS 9.

                    More busy VM = more chance of happening. Uptime may be a factor, too.

                    1 Reply Last reply Reply Quote 0
                    • laszlobortelL Offline
                      laszlobortel
                      last edited by

                      I am afraid that we have the same problem: ~90 Rocky8 VMs migrated from VMware, pegging one CPU very often. We have suspended further migration to XCP-ng due to this issue.
                      Has been the root cause identified since 2024? Is there a solution or workaround (apart from upgrading to Rocky9)?

                      jgraftonJ A 2 Replies Last reply Reply Quote 0
                      • jgraftonJ Offline
                        jgrafton @laszlobortel
                        last edited by

                        @laszlobortel We never reached a definitive root cause and did end up fully migrating to XCP-NG from VMware.

                        We still have roughly 100 VMs running Rocky 8.10. The 4.18.0-553.94.1 kernels and above don't seem to have the same CPU issues but I'm not sure if that's because a kernel bug was mitigated or because we upgraded our backend storage to all flash arrays (Pure Storage C50's).

                        The CPU still gets pegged on a Rocky 8 VM every once in a blue moon but not often enough to warrant more time being spent tracking it down.

                        1 Reply Last reply Reply Quote 0
                        • A Offline
                          aflons @laszlobortel
                          last edited by

                          @laszlobortel we've seen far less of this issue since my last message, not sure what made it better and when. But we're still making sure to reboot monthly (during patching, as we normally do anyways) + after live migration, and that helps. We don't use load balancing, so once a VM is staying put on one hypervisor, there is no issue. Live migration and time triggers the issue for us.

                          What changed in our infra is upgrade to XCP-NG 8.3 and moving to XOSTOR as shared storage. We've seen no issue with AlmaLinux 9 and CloudLinux 9 at all. They also perform better I/O wise.

                          laszlobortelL 1 Reply Last reply Reply Quote 1
                          • laszlobortelL Offline
                            laszlobortel @aflons
                            last edited by

                            @aflons @jgrafton First of all, I would like to thank very much both of you for replying so quickly to this old thread!
                            Our failure rate is roughly 1 frozen VM / 90 Rocky8 VMs / day, which is not tolerable. We have further hundreds of Rocky8 VMs on VMware, waiting for migration to XCP-ng.
                            I tried to summarise our options:

                            • Our kernels are pretty fresh, but we can try the very latest available for Rocky 8.
                            • Upgrading to Rocky 9 on the sort term is not an option. We have to migrate Rocky 8 from VMware to XCP-ng first, then we can think about switching to Rocky 9 later.
                            • VMware tools removed during migration as part of the migration procedure.
                            • We are aready on shared lvmohba storage, which is a production grade Hitachi Vantara all SSD, same as under VMware, so I see no room for change/improvement here.
                            • As last resort we can try disable load-balancing plugin and reboot monthly during our maintenance window, but this would be an ugly workaround.

                            Is there anything I forgot?

                            @jgrafton Was there any useful suggestion or conclusion in your Vates support ticket #7726289? I am afraid that we are facing a tricky interworking issue between the xen hypervisor and the 4.18.0 kernel and both components are independent from XCP-ng and Vates.

                            A D jgraftonJ 3 Replies Last reply Reply Quote 0
                            • A Offline
                              aflons @laszlobortel
                              last edited by

                              @laszlobortel yes I definately think load balancing is the issue for you. Since live migrations is the biggest trigger.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                That would be an interesting lead to see if the issue is triggered by live migrations, this could be a hint on the issue.

                                1 Reply Last reply Reply Quote 0
                                • D Offline
                                  DustinB @laszlobortel
                                  last edited by

                                  @laszlobortel While I can understand "Upgrading not being an option" you're lift and shifting the workload (or at least have been attempting to do this to date).

                                  Are you unable to build new and migrate data over to XCP-ng, while I could see this causing more work, lift and shifting is almost always a guaranteed way to cause headaches - like the ones you're experiencing.

                                  That is why each service provider recommends building new if you can. At the same time that you're building new, you're updating which of course can cause issues - but continuing to run Rocky8 is only receiving security updates until 2029. Sure it has a few years left, but why not take the opportunity to upgrade?

                                  J laszlobortelL 2 Replies Last reply Reply Quote 0
                                  • J Offline
                                    john.c @DustinB
                                    last edited by john.c

                                    @DustinB said:

                                    @laszlobortel While I can understand "Upgrading not being an option" you're lift and shifting the workload (or at least have been attempting to do this to date).

                                    Are you unable to build new and migrate data over to XCP-ng, while I could see this causing more work, lift and shifting is almost always a guaranteed way to cause headaches - like the ones you're experiencing.

                                    That is why each service provider recommends building new if you can. At the same time that you're building new, you're updating which of course can cause issues - but continuing to run Rocky8 is only receiving security updates until 2029. Sure it has a few years left, but why not take the opportunity to upgrade?

                                    They can possibly go up to Rocky 9 but Rocky 10 later may be harder, as it requires a higher baseline on the CPU. Also they may have legacy software that only works on Rocky 8.

                                    1 Reply Last reply Reply Quote 0
                                    • jgraftonJ Offline
                                      jgrafton @laszlobortel
                                      last edited by

                                      @laszlobortel We concluded that older Linux kernels plus live migrations plus lvmohba storage seems to trigger the issue. Our workaround was to upgrade to a mainline 6.x kernel packaged by ElRepo https://elrepo.org/wiki/doku.php?id=start for Rocky 8 systems that were especially prone to the CPU hang.

                                      The kernel upgrades effectively stopped the issue from occurring.

                                      laszlobortelL 1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        @jgrafton said:

                                        @laszlobortel We concluded that older Linux kernels plus live migrations plus lvmohba storage seems to trigger the issue. Our workaround was to upgrade to a mainline 6.x kernel packaged by ElRepo https://elrepo.org/wiki/doku.php?id=start for Rocky 8 systems that were especially prone to the CPU hang.

                                        The kernel upgrades effectively stopped the issue from occurring.

                                        That's ultra helpful interesting @jgrafton 🤔

                                        Maybe it's even worth a KB/known issue in our official doc, let me ping @thomas-dkmt

                                        I suppose https://docs.xcp-ng.org/troubleshooting/common-problems/ might be the right place to document it.

                                        1 Reply Last reply Reply Quote 1
                                        • laszlobortelL Offline
                                          laszlobortel @DustinB
                                          last edited by

                                          @DustinB I wrote "Upgrading to Rocky 9 on the short term is not an option." Please let me explain why! We are a telco with layered operation model: our team is responsible for virtualisation (VMware/Broadcom, Hyper-V, XCP-ng), another team is responsible for OS operation. The IaaS team is tasked with VMware exit, which means that we must migrate hundreds of VMs from VMware to XCP-ng as quick as possible this year, unchanged, with "lift-and-shift" method. It is a requirement that a VM which runs on VMware should run on XCP-ng, preferably unchanged. Even a simple kernel upgrade causes some delay in our migration plan. We can propose to the OS team that they should migrate to Rocky9, and they might consider and schedule it but it will not happen immediately.
                                          Apart from this organisational reason my experience tells that while upgrading to Rocky9 would most probably solve this issue it would raise others (probably in docker/kubernetes layer or in application layer).

                                          1 Reply Last reply Reply Quote 0
                                          • laszlobortelL Offline
                                            laszlobortel @jgrafton
                                            last edited by

                                            @jgrafton I am a bit confused with the role of lvmohba storage in triggering this problem, because @aflons stated above (back in 2024) that "Seems to happen far less now with shared storage."
                                            It is not clear for me if shared storage helps to solve the problem or makes it worse? Or lvmohba is a special kind of "bad" shared storage in this aspect?
                                            In any case lvmohba is a fixed point in our architecture, that we cannot replace. I am just curious if we should experiment with another type of storage to rule out or confirm the contribution of lvmohba in this problem.

                                            1 Reply Last reply Reply Quote 0

                                            Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                                            Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                                            With your input, this post could be even better 💗

                                            Register Login
                                            • First post
                                              Last post