XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CPU pegged at 100% in several Rocky Linux 8 VMs without workload in guest

    Scheduled Pinned Locked Moved Compute
    14 Posts 4 Posters 2.3k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • jgraftonJ Offline
      jgrafton @olivierlambert
      last edited by

      @olivierlambert That was my initial thought, PV driver in the older kernel. No process is using very much CPU in the guest though the total CPU is at 100% (when running top in the VM).

      Haven't been able to get Rocky 9 to fail yet, but it can take a day or two.

      1 Reply Last reply Reply Quote 1
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Keep us posted! I will try to start a Rocky 8 VM to see if it's doing this too. Anything in the xl dmesg?

        1 Reply Last reply Reply Quote 0
        • J Offline
          jshiells @jgrafton
          last edited by

          @jgrafton check your VM's CPU Steal time. my guess is whats where its going.

          make sure vmware tools is not running/deleted

          give the VM a reboot , should remove the steal time CPU usage, if that is the problem

          we have seen this issue when hot migrating VM's between pools (xcp-ng to xcp-ng or XEN to xcp-ng)

          jgraftonJ 1 Reply Last reply Reply Quote 0
          • jgraftonJ Offline
            jgrafton @jshiells
            last edited by

            @olivierlambert Nothing out of the ordinary in xl dmesg that I can tell.

            @jshiells I'm pretty sure the VMs have had the Vmware tools removed since that's a part of our migration procedure but I'll double check.

            Annoyingly, we haven't been able to get a VM to fail all day.

            J 1 Reply Last reply Reply Quote 0
            • J Offline
              jshiells @jgrafton
              last edited by

              @jgrafton the Steal time CPU usage "may not" have anything todo with vmware tools.

              I have seen this happen by just hot migrating older linux systems form host to host inside the same pool... as well as hot migrating between two different pools. I have also seen the load balance plugin trigger this on old linux versions when it moves a VM from host to host. i honestly dont think it has anything todo with XCP-NG but more how the linux VM is dealing with the very short pause during migrations. == causes 100% cpu steal time to kick in.

              jgraftonJ 1 Reply Last reply Reply Quote 0
              • jgraftonJ Offline
                jgrafton @jshiells
                last edited by

                @jshiells I was wrong, open-vm-tools is installed on a lot of the systems we migrated. I just assumed it wasn't instead of checking. We'll remove it from all the systems, test further, and report back. Thank you for the insight!

                1 Reply Last reply Reply Quote 1
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Ah great catch and suggestion @jshiells ! It's not impossible previous VM tools are causing issues 🙂

                  jgraftonJ 1 Reply Last reply Reply Quote 0
                  • jgraftonJ Offline
                    jgrafton @olivierlambert
                    last edited by

                    @olivierlambert So it turns out this issue wasn't caused by open-vm-tools.

                    Even after uninstalling it from all our hosts in XCP, we still had several hosts climb to 100% CPU shortly after a migration.

                    While combing through sar logs and several crash dumps, I found that the system load would rapidly increase in a short amount of time until the host was unreachable.

                    I gathered from the crash dumps that the high load appeared to be caused by threads in spinlocks waiting on storage.

                    This led me to believe the older kernel (4.18) was having difficulty recovering from the migration process.

                    The simple fix was to upgrade the OS to Rocky 9 on some hosts and upgrade the kernel on ones not ready to have the OS upgraded.

                    We've been running for a couple weeks without an issue.

                    J 1 Reply Last reply Reply Quote 1
                    • J Offline
                      jshiells @jgrafton
                      last edited by

                      @jgrafton its a good theory, just for awareness i have seen this problem on :

                      • Debian 7,8,9
                      • Ubuntu 18
                      • CenOS 7,8
                      • Alma 8

                      so it could be a xcp-ng and Kernel 4 issue but definitely not limited to centos/rocky/alma (same same)

                      oddly enough i have not seen this issue on CloudLinux 7,8

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        aflons
                        last edited by aflons

                        We experience the exact same issue with CloudLinux OS 8, seemingly random after live migration. This has been ongoing for years. Seems to happen far less now with shared storage.

                        My theory somehow the kernel and/or PVE module doesn't handle the freeze during live migration, longer freeze, more risk of this happening.

                        VMs start to crash random amount of time after live migration, never immideate. Could be hours, or days even, making it hard to diagnose. No crash dump, nothing, just 100% CPU on all cores and frozen console.

                        One consistent thing we see, that happens almost every time, is that top and other tools stop working, they are frozen in a state were no CPU load etc is reported, but there is load on the server.

                        We've been going back and forth with CloudLinux support and they did some changed to tuned profile regarding disk buffers/cache that made things at bit more stable but not gone 100%.

                        We don't see the same error in AlmaLinux 9 and CloudLinux OS 9.

                        More busy VM = more chance of happening. Uptime may be a factor, too.

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post