XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Rolling pool update failure: not enough PCPUs even though all should fit (dom0 culprit?)

    Scheduled Pinned Locked Moved Management
    4 Posts 3 Posters 129 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P Offline
      pkgw
      last edited by

      Hi,

      I have a small test VM cluster that I'm trying to apply a rolling pool update to. There are three physical hosts, with 32, 32, and 12 CPUs, respectively. When I try to initiate the update, it insta-fails with the error:

      "CANNOT_EVACUATE_HOST(HOST_NOT_ENOUGH_PCPUS,16,12)"
      

      My understanding is that this means that the updater needs to move a VM requiring 16 vCPUs onto the machine with 12 pCPUs.

      The mystery is that none of my VMs need nearly that many CPUs! I've dialed them all down to 2 vCPUs, and the error message is the same.

      Looking at the xe vm-list output, I do see that two of the Control domain on host: ... VMs do want 16 vCPUs. Are those potentially the culprit, here? What would be the recommended way to dial down their CPU allocations? I've seen some messages about using the host-cpu-tune command and I could try playing around with xe, but I'm a little hesitant to fiddle around with these parts of the infrastructure without really knowing what I'm doing.

      1 Reply Last reply Reply Quote 0
      • P Offline
        pkgw
        last edited by

        Of course, just after posting, I think I figured out what's happening.

        It looks like the relevant parameter isn't the current number of allowed vCPUs set via the UI (VCPUs-number), but the maximum number of vCPUs (VCPUs-max). One of the VMs in my cluster had VCPUs-max = 16. After powering it off, I could reduce this number, and now the RPU appears to be proceeding.

        1 Reply Last reply Reply Quote 0
        • K Offline
          kagbasi-ngc
          last edited by

          I have seen this problem before in my test lab. Unfortunately, I didn't document it enough to report here. For me, the solution was also to simply power off the culprit VM to prevent the attempted migration.

          In my mind, I think the RPU logic should be using the current running state of VMs to determine resources currently in use and which hosts can support that. Since the move is only temporary. Then again, I'm not in a know of all the factors that went into the decision to have it working the way it is. I'm sure there's a valid reason.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            That's a security problem, due to Spectre/Meltdown, it's very dangerous to run a VM that could have more vCPUs than pCPUs on a host.

            1 Reply Last reply Reply Quote 0
            • First post
              Last post