HPC with 2x64core (256 threads) possible with XCP-ng?

olivierlambert

If you can, please provide feedback We'll be happy to learn if there's any problem!

Forza

Of course

Does anyone else have any experience with HPC on this scale with XCP-ng/Xen/Xenserver?

In the testing we did on the EPYC we see that best performance is gained when only physical cores are allocated to the VM. So giving 24 cores to the VM was faster than giving 48 virtual threads. I suspect that the bottleneck is RAM bandwidth. The simulation uses about 100-200GB RAM (in these tests). I am not sure how it would scale with a dual CPU (and so a NUMA situation) would happen.

We did the tests on a an older dual Xeon CPU workstation (not virtualised) with 512GB RAM. The software seems to detect hyperthreading and only uses half of the available threads. This detection did not happen when we run it in a VM, which might explain the results.

olivierlambert

For HPC, you might want to use CPU pinning or things like that. Flexibility of virtualization is maybe not required will squeezing most performance is the key parameter.

Forza

The question relates to effective use of expensive hardware. Virtualizing it enables more possible when simulations aren't running by allowing other vms on that host.

But do we know if more than 64 threads are a possibility with xcp-ng?

olivierlambert

The VM will run, yes. It's not a very known territory, that's why I'm asking for feedback

tuxen

@Forza Take a look:

https://xcp-ng.org/forum/post/49400

At the time of this topic, I remember asking a coworker to boot a CentOS 7.9 with more than 64 vcpus on a 48C/96T Xeon server. The VM started normally, but it didn't recognizes the vcpus > 64.

I've not tested that VM param platform:acpi=0 as a possible solution and the trade-offs. In the past, some old RHEL 5.x VMs without acpi support would simply power off (like pulling the power cord) instead of a clean shutdown on a vm-shutdown command.

Regarding that CFD software, does it support a worker/farm design? vGPU offload? I'm not a HPC expert but considering the EPYC MCM architecture, instead of a big VM, spreading the workload across many workers pinned to each CCD (or each numa nodes on a NPS4 confg) may be interesting.

Before buying those monsters, I would ask AMD to deploy a PoC using the target server model. For such demands, it's very important to do some sort of certification/validation.

Forza

@tuxen thank you. Those are very valuable thoughts. There is a remote render mode that can be used to render on a farm of nodes. The problem is making models that scale well in such a configuration. This is why we started with a Xeon workstation many years ago, but I do agree that it might be worth looking at this option again! The cost for render licensing is also higher than that of the hardware which is another factor. Maybe it's possible to rent some cloud hw space and test.

mersper

Hi @Forza

I run a few linux HPC clusters, one particular platform runs on XCP-ng VMs using AMD epycs. We see a max of 64 vcpus recognised by the VM on XCP. You can assign more, but they are not visible from the VM OS. There also seems to be a total RAM size per VM too, which is 512GB, if memory serves (no pun intended ).

With regard to your other question - 7773X vs 7204 - on linux I would suspect that most codes would run the same binaries on both, but you may see a performance hit if an optimized binary wasn't compiled against one or the other cpu. But of course, there could just as easily be lots of other reasons for differences in performance across these boxes.

Forza

@mersper What kind of VMs do you use with this, and how do you think the performance scales to 64x cores?

mersper

@Forza , Virtualization is PVHVM, running RockyLinux8. VM root disk is local to the host on RAID10 SAS drives, CPUs are dual-socket AMD EPYC 7552 48C/96T, and 512GB physical RAM. And no GPU.

Flexibility is prioritised over performance on this cluster - it's used for undergraduate teaching and projects. We don't do cpu-pinning for instance.

We typically run bioinformatics and Molecular Dynamics codes. If we look at MD codes (high CPU, low RAM , low IO), they scale as expected up to 64 cores - I'm pretty happy with the performance. But having said that, I haven't compared directly with bare-metal.

Forza

@mersper Thank you. I will re-think the setup. Having 256 threads in one VM isn't perhaps possible. I have scheduled a meeting with the software manufacturer to talk about network rendering etc. It might be better to have serveral VMs with pinned CPUs and run render jobs. I'll update on the progress