NUMA-impact - Xeon/Epyc - 1P vs 2P
-
Hi!
I need to size some systems and there are often comments about the advantage of EPYC CPUs over Xeon CPUs (both with latest gen).
But, there are some things, I do not really understand:
How big is the impact of NUMA effects with these CPUs?
Example:
Taking one Xeon CPU with 32C --> no NUMA, no Problem
Taking two Xeon CPUs with 16C each --> 2 NUMA domains. --> Would this configuration be slower than the 1P-system, althouth the TDP is lower and so the potential CPU frequency?More interesting: EPYC
Taking one EPYC with 32C can lead to 1 NUMA domain
Taking one EPYC with 32C can lead to 8 NUMA domains with CCX as NUMA
--> Would an EPYC 8C 72F3 be slower than a 75F3 also in case that only 4 cores are under load, because they could share L2 cache on 75F3 with lower frequency?So the resulting Question:
Without considering the costs: How can you get the best performance on a 32C-system?- 2x Xeon 6346
- 1x Xeon 8358
- 2x EPYC 73F3 with/wihtout NUMA per CCX
- 1x EPYC 75F3 with/wihtout NUMA per CCX
...where rarely more than 50% of the cores are under load...
Thank you for your thoughts
KPSInteresting source:
https://wiki.tnonline.net/w/Blog/Xen_performance_using_NUMA_on_EPYC_CPUs
https://dl.dell.com/manuals/common/dell-emc-dfd-numa-amd-epyc-2ndgen.pdf -
There is no universal answer (because it's mostly depending on your VM load and what do you expect). As usual, my advice is to keep it simple if you don't have a problem with it (ie: you are satisfied by the perf.). Even a default EPYC configuration will be likely always better than a Xeon one.
After that, if you want to go deeper and learn the details, it's OK, let me just ping @tjkreidl who did a remarkable job (if I remember correctly) on this very topic.
-
I'd say that the EPYCs would be better than the Xeons no matter the NUMA per CCX setting. But, I would suggest looking at the EPYC genoa, which is 2-3x faster for the same cost as the previous EPYCs! Absolutely amazing performance and value.
https://www.phoronix.com/review/amd-epyc-9654-9554-benchmarks
https://www.phoronix.com/review/amd-epyc-9374fXen supports NUMA scheduling. The issue, I think, is whether the VM fits within a NUMA node or not, and if the application and/or guest VM understands NUMA in a guest environment. Only way to know is to properly benchmark the specific application with NUMA per CCX on or not and the number of cores you think the VM will need.
-
@Forza
Thank you for your answer. If I take the "EPYC-path": Did you ever see, that a 2P-system is slower, than its 1P-pendant? -
@KPS I have no experience of a 2P system so far, so I cannot say : (
-
@olivierlambert said in NUMA-impact - Xeon/Epyc - 1P vs 2P:
There is no universal answer (because it's mostly depending on your VM load and what do you expect). As usual, my advice is to keep it simple if you don't have a problem with it (ie: you are satisfied by the perf.). Even a default EPYC configuration will be likely always better than a Xeon one.
After that, if you want to go deeper and learn the details, it's OK, let me just ping @tjkreidl who did a remarkable job (if I remember correctly) on this very topic.
Thanks for the mention, @olivierlambert ! Here's a link to part 3, which contains links back to parts 1 and 2. Note that NUMA will affect EPYC processors differently as they changed the die configuration at one point with the number of cores. I'm open for any questions on this topic. https://blogs.mycugc.org/2019/04/30/a-tale-of-two-servers-part-3-the-influence-of-numa-cpus-and-sockets-cores-persocket-plus-other-vm-settings-on-apps-and-gpu-performance/
-
Ah yes, that was exactly this great article I had in mind!
-
@tjkreidl
Hi Tobias! Nice to see your answer. We had a call about 10 years ago about XenserverThank you for your analysis. This topic seems to be much more complicated, than I hoped it is. In your tests, adding a second socket did never lead to a lower performance, than the 1P system.
In theory, your 8-vCPU-test should be faster, if it does not need to access e.g. memory of the second CPU, but in real life, this seems to be not so relevant...What would be your "what to buy"-recommendation, today?
-
Some more resources:
https://developer.amd.com/wp-content/resources/56827-1-0.pdf chapter 2.5 NUMA and CCX/CCD
-
@KPS said in NUMA-impact - Xeon/Epyc - 1P vs 2P:
@tjkreidl
Hi Tobias! Nice to see your answer. We had a call about 10 years ago about XenserverThank you for your analysis. This topic seems to be much more complicated, than I hoped it is. In your tests, adding a second socket did never lead to a lower performance, than the 1P system.
In theory, your 8-vCPU-test should be faster, if it does not need to access e.g. memory of the second CPU, but in real life, this seems to be not so relevant...What would be your "what to buy"-recommendation, today?
Hey, @KPS! Nice to hear from you and, yes, it's a pretty complex interactions of pieces that makes tuning so hard. There are whole books on tuning I've see, some going way back to Digital Equipment Corporation VAX machines.
As to recommendations, especially if you have a lot of external storage I/O, I'd opt for CPUs with no less than 3.0 GHz clock speeds and a fair amount of internal cache, as loads are also going to be potentially bottle-necked there. As to CPU-cause NUMA, as my tests sow, it can vary how much this effect is or not. Note also, as mentioned on one of th eartivles, that the order you start of VMs can make a big difference; those that are more affected by NUMA should be launched first to better ensure they get contained on one of the physiccal CPU modules and its associated memory banks.
Generally, each system is unique enough that it may entail a lot of experimentation to find the best settings. And don't forget to check your BIOS settings, as well, to see how they are configured. Hyperthreading is quite a controversial topic, as well, and I'd just put in my $0.02' worth to say that for us, it helped a lot since our CPUs were over-provisioned by something like a factor of six since we were running a lot of XenDesktop VMs.
In short, get the fastest processors and memory you can afford! -
Also something to keep in mind: It's not only about NUMA (which is different since 2nd Epyc gen, as they have all memory channels on an IO-Die and only split the caches now), it's also about memory bandwith!
So it adds more complexity and depends on the needs of your workload.
If it benefits from high memory bandwith, a 2nd socket doubles it (technically)!