More than 64 vCPU on Debian11 VM and AMD EPYC

alexredston

I'm getting stuck with this too - on Debian 11 VM - DL 580 with 4 x Xeon E7-8880 v4 + 3 Samsung 990 Pro 4TB with RAID 1.

Effectively the XCP-NG host has 176 "cores" i.e. with the hyperthreading. But I'm only able to use 64 of them. I was also only able to configure the VM with 120 cores too as 30 with 4 sockets. (Physical architecture has 4 sockets), but I think only 64 actually work.

So I'm compiling AOSP, for a clean build, VM is sticking at max CPU for 30 minutes and I would dearly like to reduce that time, as it could be a compile after a tiny change, so progress is painfully slow. The other thing is the linking phase of this build, I'm only seeing 7000 IOPs with the last 10 minute display. I realize this may under read as the traffic could be quite "bursty" but, having 3 mirrored Samsung 990 Pro drives I would expect more. This makes this part heavily disk bound, the over all process takes 70 minutes.

olivierlambert

If you are heavily relying on disk perf, either:

use multiple VDIs and RAID0 them (you'll have more than doubling perf because tapdisk is single threaded)
PCI passthrough a drive to the VM

POleszkiewicz

@olivierlambert said in More than 64 vCPU on Debian11 VM and AMD EPYC:

If you are heavily relying on disk perf, either:

use multiple VDIs and RAID0 them (you'll have more than doubling perf because tapdisk is single threaded)

PCI passthrough a drive to the VM

another option is to do NVMeOF and SR-IOV on the NIC, pretty similar performance to bare metal with PCI passthrough, yet one NVMe can be divided between VMs (if it supports namespaces) and you can attach NVMe from more than one source to the VM (for redundancy)

olivierlambert

DPU is also an option (it's exactly what we do with Kalray's DPUs)

TodorPetkov

@alexredston What kernel do you use? Can you show the kernel boot parameters (/proc/cmdline) - in our case we used the Debian11 image from their website which had cloud kernel and acpi=on by default. Once we switched to regular kernel and turned off acpi, we saw all the vCPUs in the VM.

POleszkiewicz

@olivierlambert what exactly do you support from kalray? Could you tell more?

olivierlambert

https://xcp-ng.org/blog/2021/07/12/dpus-and-the-future-of-virtualization/

https://xcp-ng.org/blog/2021/12/20/dpu-for-storage-a-first-look/

POleszkiewicz

@olivierlambert interesting, however where is the benefit over nvmeof + sriov doable on a mellanox cx3 or better cx5 and up? Offloading dom0 to specialized hardware is interesting, but what I see in these articles is basically equal to connecting to nvmeof target via sriov nic, doable already for quite a while without any changes in xcp-ng?

olivierlambert

It's using local NVMe and split them, no need for external storage (but you can also use remote NVMe like in oF but also potentially multiple hosts in HCI mode)/

POleszkiewicz

@olivierlambert with NVMeOF I can split them easily too (target per namespace), and actually I get redundancy compared to local device (connect to two targets on different hosts and RAID1 them in VM). Some newer NVMe support SR-IOV natively too, so no additional hardware would be needed to split it and pass through to VMs (I did not test this though). I'm not sure of the price of these cards, but CX3 are really cheap, while CX5/6 are getting more affordable too.

olivierlambert

If you can afford a dedicated storage, sure For local, DPU is a good option (and it should be less than 1,5k€ per card, probably less)

alexredston

@olivierlambert @POleszkiewicz Thanks to you both for all of these ideas - I will have a go at changing the kernel and moving the NVMe to pass through in the first instance. Will report back on results.

alexredston

@TodorPetkov Top tip! Thank you - going to try this out

alexredston

@TodorPetkov that was very helpful. I've added acpi=off to grub and I am now able to get 128 "CPUs" running, which is double.

When I go beyond this I get the following error when attempting to start the VM

INTERNAL_ERROR(xenopsd internal error: Xenctrl.Error("22: Invalid argument"))

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2194.589
BogoMIPS: 4389.42
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 56320K
NUMA node0 CPU(s): 0-127

Going to move some stuff around and try passthrough for the M.2 drives next as IOPs is now the biggest performance barrier for this particular workload.

alexredston

@olivierlambert following a similar approach of multiple VDIs and going raid 1 with 3 way mirror (integrity is critical) will I still see a similar read performance increase, I'm not so worried about the write penalty?

olivierlambert

Yes, since you'll read on multiple disks. You shouldn't see any diff in write though.

alexredston

@olivierlambert Interestingly, so far I've seen about a 40% increase in write performance and IOPS from adjusting the scheduler in dom0 by adding elevator=noop as a kernel parameter and a further 10% from repeating the same on the VM.

I'm going to experiment next with migrating the disks so that the mirror is achieved in the VM with three separate pifs instead of in dom0. Then may try other more radical approaches like passthrough.

olivierlambert

That's a very nice increase. Indeed, noop is the best option for NVMe devices.

alexredston

@olivierlambert will repeat on everything!

alexredston

@olivierlambert Thanks to everyone's great advice. I've now managed a further more than 20 fold increase by using PCI passthrough on the 3 x NVMe drives, machine is only PCIe 3.x but still I'm getting 10.5GB /s reading on the test with fio and just over 1GB/s write.

My bottleneck for compiling is now once again the CPUs.

I seem to be unable to exceed 128 CPUs, was hoping to assign more as the host has 176 but it is struggling, at the moment my build is pinning those 128 at 100% CPU for 30 minutes so this could potentially offer a fairly significant improvement.

Overall quite pleased to be squeezing this much performance out of some old HPE Gen 9 hardware. May look at adding another disk to the mirror, but at some point the write penalty may outweigh the excellent read performance. I've put chosen slots based on ensuring each NVMe's PCI lanes are connected to a different host CPU.

May try another experiment with smaller PCIe devices and bifurication and see if I can test the upper limits of the throughput. 9 slots to play with!