XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CPU pegged at 100% in several Rocky Linux 8 VMs without workload in guest

    Scheduled Pinned Locked Moved Compute
    28 Posts 7 Posters 4.2k Views 8 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      aflons @laszlobortel
      last edited by

      @laszlobortel yes I definately think load balancing is the issue for you. Since live migrations is the biggest trigger.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        That would be an interesting lead to see if the issue is triggered by live migrations, this could be a hint on the issue.

        1 Reply Last reply Reply Quote 0
        • D Offline
          DustinB @laszlobortel
          last edited by

          @laszlobortel While I can understand "Upgrading not being an option" you're lift and shifting the workload (or at least have been attempting to do this to date).

          Are you unable to build new and migrate data over to XCP-ng, while I could see this causing more work, lift and shifting is almost always a guaranteed way to cause headaches - like the ones you're experiencing.

          That is why each service provider recommends building new if you can. At the same time that you're building new, you're updating which of course can cause issues - but continuing to run Rocky8 is only receiving security updates until 2029. Sure it has a few years left, but why not take the opportunity to upgrade?

          J laszlobortelL 2 Replies Last reply Reply Quote 0
          • J Offline
            john.c @DustinB
            last edited by john.c

            @DustinB said:

            @laszlobortel While I can understand "Upgrading not being an option" you're lift and shifting the workload (or at least have been attempting to do this to date).

            Are you unable to build new and migrate data over to XCP-ng, while I could see this causing more work, lift and shifting is almost always a guaranteed way to cause headaches - like the ones you're experiencing.

            That is why each service provider recommends building new if you can. At the same time that you're building new, you're updating which of course can cause issues - but continuing to run Rocky8 is only receiving security updates until 2029. Sure it has a few years left, but why not take the opportunity to upgrade?

            They can possibly go up to Rocky 9 but Rocky 10 later may be harder, as it requires a higher baseline on the CPU. Also they may have legacy software that only works on Rocky 8.

            1 Reply Last reply Reply Quote 0
            • jgraftonJ Offline
              jgrafton @laszlobortel
              last edited by

              @laszlobortel We concluded that older Linux kernels plus live migrations plus lvmohba storage seems to trigger the issue. Our workaround was to upgrade to a mainline 6.x kernel packaged by ElRepo https://elrepo.org/wiki/doku.php?id=start for Rocky 8 systems that were especially prone to the CPU hang.

              The kernel upgrades effectively stopped the issue from occurring.

              laszlobortelL 1 Reply Last reply Reply Quote 0
              • olivierlambertO Online
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                @jgrafton said:

                @laszlobortel We concluded that older Linux kernels plus live migrations plus lvmohba storage seems to trigger the issue. Our workaround was to upgrade to a mainline 6.x kernel packaged by ElRepo https://elrepo.org/wiki/doku.php?id=start for Rocky 8 systems that were especially prone to the CPU hang.

                The kernel upgrades effectively stopped the issue from occurring.

                That's ultra helpful interesting @jgrafton 🤔

                Maybe it's even worth a KB/known issue in our official doc, let me ping @thomas-dkmt

                I suppose https://docs.xcp-ng.org/troubleshooting/common-problems/ might be the right place to document it.

                1 Reply Last reply Reply Quote 1
                • laszlobortelL Offline
                  laszlobortel @DustinB
                  last edited by

                  @DustinB I wrote "Upgrading to Rocky 9 on the short term is not an option." Please let me explain why! We are a telco with layered operation model: our team is responsible for virtualisation (VMware/Broadcom, Hyper-V, XCP-ng), another team is responsible for OS operation. The IaaS team is tasked with VMware exit, which means that we must migrate hundreds of VMs from VMware to XCP-ng as quick as possible this year, unchanged, with "lift-and-shift" method. It is a requirement that a VM which runs on VMware should run on XCP-ng, preferably unchanged. Even a simple kernel upgrade causes some delay in our migration plan. We can propose to the OS team that they should migrate to Rocky9, and they might consider and schedule it but it will not happen immediately.
                  Apart from this organisational reason my experience tells that while upgrading to Rocky9 would most probably solve this issue it would raise others (probably in docker/kubernetes layer or in application layer).

                  1 Reply Last reply Reply Quote 0
                  • laszlobortelL Offline
                    laszlobortel @jgrafton
                    last edited by

                    @jgrafton I am a bit confused with the role of lvmohba storage in triggering this problem, because @aflons stated above (back in 2024) that "Seems to happen far less now with shared storage."
                    It is not clear for me if shared storage helps to solve the problem or makes it worse? Or lvmohba is a special kind of "bad" shared storage in this aspect?
                    In any case lvmohba is a fixed point in our architecture, that we cannot replace. I am just curious if we should experiment with another type of storage to rule out or confirm the contribution of lvmohba in this problem.

                    1 Reply Last reply Reply Quote 0
                    • laszlobortelL Offline
                      laszlobortel
                      last edited by

                      We have checked our kernel versions, some are very old:

                           10  4.18.0-553.109.1.el8_10.x86_64
                            3  4.18.0-553.117.1.el8_10.x86_64
                            2  4.18.0-553.120.1.el8_10.x86_64
                            4  4.18.0-553.16.1.el8_10.x86_64
                            2  4.18.0-553.22.1.el8_10.x86_64
                           12  4.18.0-553.30.1.el8_10.x86_64
                            2  4.18.0-553.34.1.el8_10.x86_64
                            4  4.18.0-553.36.1.el8_10.x86_64
                            1  4.18.0-553.40.1.el8_10.x86_64
                            2  4.18.0-553.47.1.el8_10.x86_64
                            3  4.18.0-553.51.1.el8_10.x86_64
                            2  4.18.0-553.54.1.el8_10.x86_64
                            3  4.18.0-553.56.1.el8_10.x86_64
                            1  4.18.0-553.58.1.el8_10.x86_64
                            4  4.18.0-553.62.1.el8_10.x86_64
                            2  4.18.0-553.63.1.el8_10.x86_64
                            3  4.18.0-553.69.1.el8_10.x86_64
                            2  4.18.0-553.74.1.el8_10.x86_64
                            1  4.18.0-553.81.1.el8_10.x86_64
                            1  4.18.0-553.83.1.el8_10.x86_64
                            2  4.18.0-553.87.1.el8_10.x86_64
                           16  4.18.0-553.89.1.el8_10.x86_64
                            4  4.18.0-553.94.1.el8_10.x86_64
                      

                      The OS team will do the kernel upgrade and I will come back with the result. Versions .109, .117, .120 did not fail yet in our environment. We have high hopes!

                      jgraftonJ 1 Reply Last reply Reply Quote 2
                      • jgraftonJ Offline
                        jgrafton @laszlobortel
                        last edited by

                        @laszlobortel Hehe yeah, those are pretty old kernels. I'd say there's a good chance kernel upgrades will go a long way to alleviating the CPU hangs.

                        I can't say it's exclusively a problem with lvmohba storage, that's just what we use because of our previous VMware infrastructure was block storage over fiber channel. We knew a physical infra overhaul wasn't in the cards for us for this migration so we stayed with our existing storage system. This bug bit us half way through the migration until we figured out upgrading the kernel generally fixed it.

                        1 Reply Last reply Reply Quote 0

                        Hello! It looks like you're interested in this conversation, but you don't have an account yet.

                        Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.

                        With your input, this post could be even better 💗

                        Register Login
                        • First post
                          Last post