I'll be the one investigating this further, we're trying to compile a list of CPUs and their behavior. First, thank you for your reports and tests, that's already very helpful and gave us some insight already.
If some of you can help us cover more ground that would be awesome, so here is what would be an ideal for testing to get everyone on the same page:
- An AMD host, obviously
- 2 VMs on the same host, with the distribution of your choice¹
- each with 4 cores if possible
- 1GB of ram should be enough if you don't have a desktop environment to load
¹: it seems some recent kernels do provide a slight boost, but in any case the performance is pretty low for such high grade CPUs.
²: iperf3 is singlethreaded, the
-P option will establish multiple connexions, but it will process all of them in a single thread, so if reaching a 100% cpu usage, it won't get much increase and won't help identifying the scaling on such a cpu. For example on a
Ryzen 5 7600 processor, we do have about the same low perfomances, but using multiple thread will scale, which does not seem to be the case for EPYC Zen1 CPUs.
- do not disable mitigations for now, as its only on kernel side, there are still mitigation active in xen, and from my testing it doesn't seem to help much, and will increase combinatory of results
- for each test, run
xentop on host, and try to get an idea of the top values of each domain when the test is running
iperf -s on VM1, and let it run (no
-P X this would stop after X connexion established)
- vm2vm 1 thread: on VM2, run
iperf -c <ip_VM1> -t 60, note result for v2v 1 thread
- vm2vm 4 threads on VM2, run
iperf -c <ip_VM1> -t 60 -P4, note result for v2v 4 threads
- host2vm 1 thread: on host, run
iperf -c <ip_VM1> -t 60, note result for h2v 1 thread
- host2vm 4 threads on host, run
iperf -c <ip_VM1> -t 60 -P4, note result for h2v 4 threads
Here is an example of report template
- number of sockets:
- cpu pinning: yes (detail) / no (use automated setting)
- xcp-ng version:
- output of
xl info -n especially the
cpu_topology section in a code block.
- distrib & version
- kernel version
- v2m 1 thread: throughput / cpu usage from xentop³
- v2m 4 threads: throughput / cpu usage from xentop³
- h2m 1 thread: througput / cpu usage from xentop³
- h2m 4 threads: througput / cpu usage from xentop³
³: I note the max I see while test is running in vm-client/vm-server/host order.
What was tested
Mostly for information, here are a few tests I ran which did not seem to improve performances.
- disabling the mitigations of various security issues at host and VM boot time using kernel boot parameters:
noibrs noibpb nopti nospectre_v2 spectre_v2_user=off spectre_v2=off nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off. Note this won't disable them at xen level as there are patches that enable the fixes for the related hardware with no flags to disable them.
- disabling AVX passing
noxsave in kernel boot parameters as there is a known issue on Zen CPU avoided boosting when a core is under heavy AVX load, still no changes.
- Pinning: I tried to use a single "node" in case the memory controllers are separated, I tried avoiding the "threads" on the same core, and I tried to spread load accross nodes, althrough it seems to give a sllight boost, it still is far from what we should be expecting from such CPUs.
- XCP-ng 8.2 and 8.3-beta1, seems like 8.3 is a tiny bit faster, but tends to jitter a bit more, so I would not deem that as relevant either.
Not tested it myself but @nicols tried on the same machine giving him about 3Gbps as we all see, on VMWare, and it went to ~25Gbps single threaded and about 40Gbps with 4 threads, and with proxmox about 21.7Gbps (I assume single threaded) which are both a lot more along what I would expect this hardware to produce.
@JamesG did test windows and debian guests and got about the same results.
Althrough we do get a small boost by increasing threads (or connexions in case of iperf3), it still is far from what we can see on other setups with vmware or proxmox).
Althrough Olivier's pool with zen4 desktop cpu do scale a lot better than EPYCs when increasing the number of threads, it still is not providing us with expected results for such powerful cpus in single thread (we do not even reach vmware single thread performances with 4 threads).
Althrough @Ajmind-0 test show a difference between debian versions, results even on debian 11 are stil not on par with expected results.
Disabling AVX only provided an improvement on my home FX cpu, which are known to not have real "threads" and share computing unit between 2 threads of a core, so it does make sense. (this is not shown in the table)
It seems that memcpy in the glibc is not related to the issue,
dd if=/dev/zero of=/dev/null has decent performances on these machines (1.2-1.3GBytes/s), and it's worth keeping in mind that both kernel and xen have their own implementation, so it could play a small role in filling the ring buffer in iperf, but I feel like the libc memcpy() is not at play here.
I'll update this table with updated results, or maybe repost it in further post.
Throughputs are in Gbit/s, noted as
G for shorter table entries.
CPU usages are for (VMclient/VMserver/dom0) in percentage as shown in
||5.64 G (120/150/220)
||7.5 G (180/230/330)
||9.5 G (0/110/160)
||13.6 G (0/300/350)
||not a zen cpu, no boost
||4.6 G (110/180/250)
||6.08 G (180/220/300)
||7.73 G (0/150/230)
||11.2 G (0/320/350)
||Ryzen 5 7600
||9.74 G (70/80/100)
||19.7 G (190/260/300)
||33.9 G (0/310/350)
||Olivier's pool, no boost
||3.38 G (?)
||2.78 G (?)
||4.44 G (?)
||6.58 G (?)
||7.6 G (?)
||10.3 G (?)
||4.4 G (?)
||1.16 G (16/17/??⁴)
||1.35 G (20/25/??⁴)
||!xcp-ng, Xen 4.18-rc + suse 15
||5.70 G (100/140/200)
||10.4 G (230/250/420)
||10.7 G (0/120/200)
||15.8 G (0/320/380)
||Ryzen 9 5950x
||7.25 G (30/35/60)
||16.5 G (160/210/300)
||17.5 G (0/110/140)
||27.6 G (0/270/330)
⁴: xentop on this host shows 3200% on dom0 all the time, profiling does not seem to show anything actually using CPU, but may be related to the extremely poor performance
last updated: 2023-11-29 16:46
All help is welcome! For those of you who already provided tests I integrated in the table, feel free to not rerun tests, it looks like following the exact protocol and provided more data won't make much of a difference and I don't want to waste your time!
Thanks again to all of you for your insight and your patience, it looks like this is going to be a deep rabbit hole, I'll do my best to get to the bottom of this as soon as possible.