Wide VMs on XCP-ng
-
Good Morning Everyone,
We are currently in the midst of testing XCP-ng as a replacement for our vSphere infrastructure. I have a general idea of the how wide VMs work in VMware, and what settings should be in place, for wide VMs that are larger than a NUMA node on our servers.
Are there any recommendations for VMs on XCP-ng that are span across NUMA nodes? Any advice or information would be greatly appreciated.
Thanks!
-
I think @dthenot or @TeddyAstie could provide some details. I would advise to open a support ticket to have a more detailed answer though (depending on your detailed infrastructure and such)
-
@plaidypus I don't know a lot about NUMA on Xen, but we have a part in the docs regarding that
https://docs.xcp-ng.org/compute/#numa-affinityAnd also other documentation on the subject
https://xapi-project.github.io/new-docs/toolstack/features/NUMA/index.html
there was a design session regarding NUMA in latest Xen Summit : https://youtu.be/KoNwEYMlhyU?list=PLQMQQsKgvLnvjRgDnb-5T51e1kGHgs1SO -
I haven't had any issues with NUMA node balancing, do you need this for a specific high performance application? Generally speaking it should "just work".
There are some things you can do to optimize but I'd only go down that road if you are running into issues with very wide VMs.
Also, are we talking cross socket NUMA nodes (e.g. multiple CPUs) or just nodes within an EPYC CPU?
-
@planedrop This is for wide VMs spanning across socket NUMA nodes. These are all dual-socket servers.
I was hoping if there are any best practices for topology or anything like that. I don't have any desire/intention to pin the vCPUs onto a specific NUMA node. Coming from vSphere, I was instructed, if at all possible, not to have more vCPUs on a VM than there are pCPUs on a NUMA node. That way you can let the scheduler handle it optimally.
-
@plaidypus Yes, it just works. The guest VM does not see NUMA info and just sees one node even if the config is set for multi socket.
You can't assign more cores than actual threads but you can assign as many as you have to one VM. There will be some performance penalty as the cores in use may use memory from another node.
If you have dual 16 core CPU with HT enabled then you could assign 64 cores to a VM.
-
@Andrew Per your example, if the NUMA size is 16 cores with 32 threads, would a 32 vCPU VM fit inside the NUMA node or would that spread them out across both nodes?
-
@plaidypus I am not 100% sure what the correct answer is for the default XCP configuration. I think the basic answer is: no. Xen/XCP does not care what cores it uses for your VM. So on average your performance will be a little worse than not crossing NUMA nodes but better than always interleaving NUMA nodes. Some systems will be better/worse than others.
Xen/XCP hypervisor does have a NUMA aware scheduler. There are two basic modes, one is CPU hard pinning where you specify which cores a VM (domain) uses. This would force the VM to use only the cores it is assigned. The other is to let Xen/XCP do its own work where it tries to schedule core use of a VM (domain) on a single CPU pool. The problem with this is the default config is to put all cores (and HT) in a single default pool. There are some options to try and enable best-effort NUMA assignment but I believe it is not set that way by default.
You can configure CPUs of a NUMA node into an individual pool (see below). A VM can be set for an affinity for a single pool (soft CPU pinning). This would keep most of the work on that single node as you want.
The links listed before to have good information about NUMA and CPU pinning. Below are some more:
Here is an older link about Xen on NUMA machines.
Here is a link about Xen CPU pools.
Here is a link about performance improvements on an AMD EPYC CPU (mostly related to AMD cache design).There are also APIs in the guest tools to allow the VM to request resources based on NUMA nodes.
If you start hard limiting VMs where/how they can run you may break migration and HA for your XCP pool.
-
@plaidypus I think the real question you should ask yourself is: do your workloads actually need to worry about the extra latency from a NUMA node span?
If you're not doing something pretty darn extreme here, I don't think it really will matter. There has been lots of talk for decades about this on VMware, but I just don't think it's that relevant anymore. Latency between sockets has gotten pretty good, so unless it's some really special workload, I don't think you should worry about this much.
The best way to validate is to just test and see how things go.
But do you have info on what workload this VM will be running?