pkgw

pkgw

Of course, just after posting, I think I figured out what's happening.

It looks like the relevant parameter isn't the current number of allowed vCPUs set via the UI (VCPUs-number), but the maximum number of vCPUs (VCPUs-max). One of the VMs in my cluster had VCPUs-max = 16. After powering it off, I could reduce this number, and now the RPU appears to be proceeding.

pkgw

Hi,

I have a small test VM cluster that I'm trying to apply a rolling pool update to. There are three physical hosts, with 32, 32, and 12 CPUs, respectively. When I try to initiate the update, it insta-fails with the error:

"CANNOT_EVACUATE_HOST(HOST_NOT_ENOUGH_PCPUS,16,12)"

My understanding is that this means that the updater needs to move a VM requiring 16 vCPUs onto the machine with 12 pCPUs.

The mystery is that none of my VMs need nearly that many CPUs! I've dialed them all down to 2 vCPUs, and the error message is the same.

Looking at the xe vm-list output, I do see that two of the Control domain on host: ... VMs do want 16 vCPUs. Are those potentially the culprit, here? What would be the recommended way to dial down their CPU allocations? I've seen some messages about using the host-cpu-tune command and I could try playing around with xe, but I'm a little hesitant to fiddle around with these parts of the infrastructure without really knowing what I'm doing.

pkgw

@olivierlambert This is all offline. Unfortunately I can't describe exactly what was done, since someone else was doing the work and they were trying a bunch of different things all in a row. I suspect that the apparently fast migration is a red herring (maybe a previous attempt left a copy of the disk on the destination SR, and the system noticed that and avoided the actual I/O?) but if there turned out to be a magical fast path, I wouldn't complain!

pkgw

@olivierlambert Thanks, that's good to know. I appreciate your taking the time to discuss. I don't suppose there are any settings we can fiddle with that would speed up the single-disk scenario? Or some workaround approach that might get closer to the hardware's native speed? (In one experiment, someone did something that caused the system to transfer the OS disk with the log message "Cloning VDI" rather than "Creating a blank remote VDI", and the effective throughput was higher by a factor of 20 ...)

pkgw

@olivierlambert Ah, yes, that is true. We are migrating one extremely large VM, so there isn't much to parallize for us, unfortunately. But it is true that someone tried something that caused the system to migrate two disks at once, and the total throughput did double.

(Specifics: the machine is 4 TiB spread across 2 big disks and 1 small OS disk, and the way that we're attempting to migrate right now it all goes serially, so the task is taking multiple days and hitting various timeouts. We think we can solve the timeouts, and we don't expect to need to do this kind of migration at all frequently, but I'd still like to understand why the single-disk throughput is so much lower than what we believe the hardware is capable of.)

pkgw

@olivierlambert No, this is an array of SSDs on a PowerVault, and we have evidence that the raw thoughput that we can get out of the system is much, much higher. I'm not 100% sure, but it seems that some software aspect of the migration framework is really bottlenecking things, although I've poked around and don't see anything that appears to be CPU-bound either.

pkgw

What generally limits the migration speed? My installation is seeing a phenomenon that might be similar — when migrating between SRs we're only getting about 20 MB/s, while everything is wired together with 10 Gbps links.

pkgw

@pkgw

Latest posts made by pkgw