Wide VMs on XCP-ng

plaidypus

Good Morning Everyone,

We are currently in the midst of testing XCP-ng as a replacement for our vSphere infrastructure. I have a general idea of the how wide VMs work in VMware, and what settings should be in place, for wide VMs that are larger than a NUMA node on our servers.

Are there any recommendations for VMs on XCP-ng that are span across NUMA nodes? Any advice or information would be greatly appreciated.

Thanks!

olivierlambert

I think @dthenot or @TeddyAstie could provide some details. I would advise to open a support ticket to have a more detailed answer though (depending on your detailed infrastructure and such)

TeddyAstie

@plaidypus I don't know a lot about NUMA on Xen, but we have a part in the docs regarding that
https://docs.xcp-ng.org/compute/#numa-affinity

And also other documentation on the subject
https://xapi-project.github.io/new-docs/toolstack/features/NUMA/index.html
there was a design session regarding NUMA in latest Xen Summit : https://youtu.be/KoNwEYMlhyU?list=PLQMQQsKgvLnvjRgDnb-5T51e1kGHgs1SO

planedrop

I haven't had any issues with NUMA node balancing, do you need this for a specific high performance application? Generally speaking it should "just work".

There are some things you can do to optimize but I'd only go down that road if you are running into issues with very wide VMs.

Also, are we talking cross socket NUMA nodes (e.g. multiple CPUs) or just nodes within an EPYC CPU?

plaidypus

@planedrop This is for wide VMs spanning across socket NUMA nodes. These are all dual-socket servers.

I was hoping if there are any best practices for topology or anything like that. I don't have any desire/intention to pin the vCPUs onto a specific NUMA node. Coming from vSphere, I was instructed, if at all possible, not to have more vCPUs on a VM than there are pCPUs on a NUMA node. That way you can let the scheduler handle it optimally.

Andrew

@plaidypus Yes, it just works. The guest VM does not see NUMA info and just sees one node even if the config is set for multi socket.

You can't assign more cores than actual threads but you can assign as many as you have to one VM. There will be some performance penalty as the cores in use may use memory from another node.

If you have dual 16 core CPU with HT enabled then you could assign 64 cores to a VM.

plaidypus

@Andrew Per your example, if the NUMA size is 16 cores with 32 threads, would a 32 vCPU VM fit inside the NUMA node or would that spread them out across both nodes?

Andrew

@plaidypus I am not 100% sure what the correct answer is for the default XCP configuration. I think the basic answer is: no. Xen/XCP does not care what cores it uses for your VM. So on average your performance will be a little worse than not crossing NUMA nodes but better than always interleaving NUMA nodes. Some systems will be better/worse than others.

Xen/XCP hypervisor does have a NUMA aware scheduler. There are two basic modes, one is CPU hard pinning where you specify which cores a VM (domain) uses. This would force the VM to use only the cores it is assigned. The other is to let Xen/XCP do its own work where it tries to schedule core use of a VM (domain) on a single CPU pool. The problem with this is the default config is to put all cores (and HT) in a single default pool. There are some options to try and enable best-effort NUMA assignment but I believe it is not set that way by default.

You can configure CPUs of a NUMA node into an individual pool (see below). A VM can be set for an affinity for a single pool (soft CPU pinning). This would keep most of the work on that single node as you want.

The links listed before to have good information about NUMA and CPU pinning. Below are some more:

Here is an older link about Xen on NUMA machines.
Here is a link about Xen CPU pools.
Here is a link about performance improvements on an AMD EPYC CPU (mostly related to AMD cache design).

There are also APIs in the guest tools to allow the VM to request resources based on NUMA nodes.

If you start hard limiting VMs where/how they can run you may break migration and HA for your XCP pool.

planedrop

@plaidypus I think the real question you should ask yourself is: do your workloads actually need to worry about the extra latency from a NUMA node span?

If you're not doing something pretty darn extreme here, I don't think it really will matter. There has been lots of talk for decades about this on VMware, but I just don't think it's that relevant anymore. Latency between sockets has gotten pretty good, so unless it's some really special workload, I don't think you should worry about this much.

The best way to validate is to just test and see how things go.

But do you have info on what workload this VM will be running?

plaidypus

@planedrop One workload is a .NET application that regularly uses 80-100% of 26 vCPUs, but unfortunately, I don't know much about the application as I am not on the development team. The other is for ElasticSearch. The latter is not much of an issue anymore (I convinced them to reduce the number of vCPUs as we showed it rarely uses that much).

planedrop

@plaidypus High CPU usage doesn't necessarily mean that NUMA spanning will be much of an issue, really comes down to latency at that point. I'd say you should be OK to just go with it, I get the hesitation though.

plaidypus

@planedrop Thanks for the information! Back to the more general idea of the wide VMs, I think it was originally more of an efficiency issue. Our Support team noticed high CPU usage, but the pCPU and overall host usage was very low.

Turns out we had stacked multiple, heavy-utilized, wide VMs on the same hosts. After looking at the stats, there so was so much co-stop that they were wasting so much time on trying to co-schedule the vCPUs. After spreading out the wide VMs we actually saw the hosts overall consume more CPU and the performance issues went away.

With us getting fresh start on a new hypervisor, instilling a desire for right-sizing VMs and scaling out versus up will probably be the way to go.

Thanks again for all your help!

planedrop

@plaidypus Ah gotcha, this makes sense.

I second scaling out instead of up.

If you're getting new hosts, I'd also keep in mind newer CPUs do have much higher per core performance (not sure what your current stuff is), so you also might be able to get away with less vCPUs and lower likelihood of NUMA spanning.

Either way though I think scaling out is the better direction to go.