Rolling pool update failure: not enough PCPUs even though all should fit (dom0 culprit?)

pkgw

Hi,

I have a small test VM cluster that I'm trying to apply a rolling pool update to. There are three physical hosts, with 32, 32, and 12 CPUs, respectively. When I try to initiate the update, it insta-fails with the error:

"CANNOT_EVACUATE_HOST(HOST_NOT_ENOUGH_PCPUS,16,12)"

My understanding is that this means that the updater needs to move a VM requiring 16 vCPUs onto the machine with 12 pCPUs.

The mystery is that none of my VMs need nearly that many CPUs! I've dialed them all down to 2 vCPUs, and the error message is the same.

Looking at the xe vm-list output, I do see that two of the Control domain on host: ... VMs do want 16 vCPUs. Are those potentially the culprit, here? What would be the recommended way to dial down their CPU allocations? I've seen some messages about using the host-cpu-tune command and I could try playing around with xe, but I'm a little hesitant to fiddle around with these parts of the infrastructure without really knowing what I'm doing.

pkgw

Of course, just after posting, I think I figured out what's happening.

It looks like the relevant parameter isn't the current number of allowed vCPUs set via the UI (VCPUs-number), but the maximum number of vCPUs (VCPUs-max). One of the VMs in my cluster had VCPUs-max = 16. After powering it off, I could reduce this number, and now the RPU appears to be proceeding.

kagbasi-ngc

I have seen this problem before in my test lab. Unfortunately, I didn't document it enough to report here. For me, the solution was also to simply power off the culprit VM to prevent the attempted migration.

In my mind, I think the RPU logic should be using the current running state of VMs to determine resources currently in use and which hosts can support that. Since the move is only temporary. Then again, I'm not in a know of all the factors that went into the decision to have it working the way it is. I'm sure there's a valid reason.

olivierlambert

That's a security problem, due to Spectre/Meltdown, it's very dangerous to run a VM that could have more vCPUs than pCPUs on a host.