Update published: https://xcp-ng.org/blog/2024/09/27/september-2024-security-updates/
Thank you for the tests!
Update published: https://xcp-ng.org/blog/2024/09/27/september-2024-security-updates/
Thank you for the tests!
Two new XSAs were published on 30th of January.
xen-*
:yum clean metadata --enablerepo=xcp-ng-testing
yum update "xen-*" --enablerepo=xcp-ng-testing
reboot
The usual update rules apply: pool coordinator first, etc.
xen
: 4.13.5-9.38.2.xcpng8.2Normal use and anything else you want to test, if you are using PCI passthrough devices that's even better, but we also would be glad to have confirmation from others that their normal use case still works as intended.
Test window before official release of the updates
2 day because of security updates.
A new XSA was published on the 23rd of January, so we have a new security update to include it.
kernel
:yum clean metadata --enablerepo=xcp-ng-testing
yum update kernel --enablerepo=xcp-ng-testing
reboot
The usual update rules apply: pool coordinator first, etc.
kernel
: 4.19.19-7.0.23.1.xcpng8.2Normal use and anything else you want to test. The closer to your actual use of XCP-ng, the better.
Test window before official release of the updates
~2 days due to security updates.
Update published https://xcp-ng.org/blog/2024/07/18/july-2024-security-updates/
Thank you everyone for your tests!
Two new XSAs were published on 16th of July.
xen-*
:
xapi
, xsconsole
:
yum clean metadata --enablerepo=xcp-ng-testing
yum update "xen-*" "xapi-*" xsconsole --enablerepo=xcp-ng-testing
reboot
The usual update rules apply: pool coordinator first, etc.
xen
: xen-4.13.5-9.40.2.xcpng8.2xapi
: xapi-1.249.36-1.2.xcpng8.2xsconsole
: xsconsole-10.1.13-1.2.xcpng8.2Normal use and anything else you want to test.
~ 1 day because of security updates.
The update has been published, thanks for testing.
https://xcp-ng.org/blog/2024/02/02/february-2024-security-update/
The update has been published, thanks for the feedback and tests.
https://xcp-ng.org/blog/2024/01/26/january-2024-security-update/
Hello guys,
I'll be the one investigating this further, we're trying to compile a list of CPUs and their behavior. First, thank you for your reports and tests, that's already very helpful and gave us some insight already.
If some of you can help us cover more ground that would be awesome, so here is what would be an ideal for testing to get everyone on the same page:
yum install iperf
²¹: it seems some recent kernels do provide a slight boost, but in any case the performance is pretty low for such high grade CPUs.
²: iperf3 is singlethreaded, the -P
option will establish multiple connexions, but it will process all of them in a single thread, so if reaching a 100% cpu usage, it won't get much increase and won't help identifying the scaling on such a cpu. For example on a Ryzen 5 7600
processor, we do have about the same low perfomances, but using multiple thread will scale, which does not seem to be the case for EPYC Zen1 CPUs.
xentop
on host, and try to get an idea of the top values of each domain when the test is runningiperf -s
on VM1, and let it run (no -P X
this would stop after X connexion established)iperf -c <ip_VM1> -t 60
, note result for v2v 1 threadiperf -c <ip_VM1> -t 60 -P4
, note result for v2v 4 threadsiperf -c <ip_VM1> -t 60
, note result for h2v 1 threadiperf -c <ip_VM1> -t 60 -P4
, note result for h2v 4 threadsHere is an example of report template
xl info -n
especially the cpu_topology
section in a code block.³: I note the max I see while test is running in vm-client/vm-server/host order.
Mostly for information, here are a few tests I ran which did not seem to improve performances.
noibrs noibpb nopti nospectre_v2 spectre_v2_user=off spectre_v2=off nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier mds=off mitigations=off
. Note this won't disable them at xen level as there are patches that enable the fixes for the related hardware with no flags to disable them.noxsave
in kernel boot parameters as there is a known issue on Zen CPU avoided boosting when a core is under heavy AVX load, still no changes.Not tested it myself but @nicols tried on the same machine giving him about 3Gbps as we all see, on VMWare, and it went to ~25Gbps single threaded and about 40Gbps with 4 threads, and with proxmox about 21.7Gbps (I assume single threaded) which are both a lot more along what I would expect this hardware to produce.
@JamesG did test windows and debian guests and got about the same results.
Althrough we do get a small boost by increasing threads (or connexions in case of iperf3), it still is far from what we can see on other setups with vmware or proxmox).
Althrough Olivier's pool with zen4 desktop cpu do scale a lot better than EPYCs when increasing the number of threads, it still is not providing us with expected results for such powerful cpus in single thread (we do not even reach vmware single thread performances with 4 threads).
Althrough @Ajmind-0 test show a difference between debian versions, results even on debian 11 are stil not on par with expected results.
Disabling AVX only provided an improvement on my home FX cpu, which are known to not have real "threads" and share computing unit between 2 threads of a core, so it does make sense. (this is not shown in the table)
It seems that memcpy in the glibc is not related to the issue, dd if=/dev/zero of=/dev/null
has decent performances on these machines (1.2-1.3GBytes/s), and it's worth keeping in mind that both kernel and xen have their own implementation, so it could play a small role in filling the ring buffer in iperf, but I feel like the libc memcpy() is not at play here.
I'll update this table with updated results, or maybe repost it in further post.
Throughputs are in Gbit/s, noted as G
for shorter table entries.
CPU usages are for (VMclient/VMserver/dom0) in percentage as shown in xentop
.
user | cpu | family | market | v2v 1T | v2v 4T | h2v 1T | h2v 4T | notes |
---|---|---|---|---|---|---|---|---|
vates | fx8320-e | piledriver | desktop | 5.64 G (120/150/220) | 7.5 G (180/230/330) | 9.5 G (0/110/160) | 13.6 G (0/300/350) | not a zen cpu, no boost |
vates | EPYC 7451 | Zen1 | server | 4.6 G (110/180/250) | 6.08 G (180/220/300) | 7.73 G (0/150/230) | 11.2 G (0/320/350) | no boost |
vates | Ryzen 5 7600 | Zen4 | desktop | 9.74 G (70/80/100) | 19.7 G (190/260/300) | 19.2G (0/110/140) | 33.9 G (0/310/350) | Olivier's pool, no boost |
nicols | EPYC 7443 | Zen3 | server | 3.38 G (?) | iperf3 | |||
nicols | EPYC 7443 | Zen3 | server | 2.78 G (?) | 4.44 G (?) | iperf2 | ||
nicols | EPYC 7502 | Zen2 | server | similar ^ | similar ^ | iperf2 | ||
JamesG | EPYC 7302p | Zen2 | server | 6.58 G (?) | iperf3 | |||
Ajmind-0 | EPYC 7313P | Zen3 | server | 7.6 G (?) | 10.3 G (?) | iperf3, debian11 | ||
Ajmind-0 | EPYC 7313P | Zen3 | server | 4.4 G (?) | 3.07G (?) | iperf3, debian12 | ||
vates | EPYC 9124 | Zen4 | server | 1.16 G (16/17/??⁴) | 1.35 G (20/25/??⁴) | N/A | N/A | !xcp-ng, Xen 4.18-rc + suse 15 |
vates | EPYC 9124 | Zen4 | server | 5.70 G (100/140/200) | 10.4 G (230/250/420) | 10.7 G (0/120/200) | 15.8 G (0/320/380) | no boost |
vates | Ryzen 9 5950x | Zen3 | desktop | 7.25 G (30/35/60) | 16.5 G (160/210/300) | 17.5 G (0/110/140) | 27.6 G (0/270/330) | no boost |
⁴: xentop on this host shows 3200% on dom0 all the time, profiling does not seem to show anything actually using CPU, but may be related to the extremely poor performance
last updated: 2023-11-29 16:46
All help is welcome! For those of you who already provided tests I integrated in the table, feel free to not rerun tests, it looks like following the exact protocol and provided more data won't make much of a difference and I don't want to waste your time!
Thanks again to all of you for your insight and your patience, it looks like this is going to be a deep rabbit hole, I'll do my best to get to the bottom of this as soon as possible.
So, after our investigations, we were able to pinpoint the issue.
It seem to happen on most RHEL derivative distributions when migrating from 8.7 to 8.8. As suggested, the bug is in the kernel.
Starting with 4.18.0-466.el8
the patch: x86/idt: Annotate alloc_intr_gate() with __init is integrated and will create the issue. It is missing x86/xen: Split HVM vector callback setup and interrupt gate allocation that should have been integrated as well.
The migration to 8.8 will move you to 4.18.0-477.*
versions that are also raising this issue, that's what you reported.
We found that the 4.18.0-488
that can be found in CentOS 8 Stream integrates the missing patch, and do indeed work when installed manually.
Your report helped us identify and reproduce the issues. That allowed us to provide a callstack to Xen devs. Then Roger Pau Monné found that it was this patch missing quickly, and we were able to find which versions of the kernel RPMs were integrating it and when the fix was integrated.
This means the issue was identified on RH side, and it is now a matter of having an updated kernel in derivative distributions like Rocky and Alma.
Could one of you try the kernel-alt
package? It is not meant for production as it is not fully tested and supported, but if a higher patch level of the 4.19 helps, it could give us more idea of what's happening.
EDIT: it should be updated to a new patch level soon-ish, so if current one does not fix, we should soon have another shot with a more recent update.
A new XSA was published on September 24th 2024.
Intel published a microcode update on the September 10th 2024.
We also included an updated xcp-ng-release
for testing, althrough not related to security.
xen-*
:microcode_ctl
:xcp-ng-release
:yum clean metadata --enablerepo=xcp-ng-candidates
yum update "xen-*" microcode_ctl xcp-ng-release --enablerepo=xcp-ng-candidates
reboot
The usual update rules apply: pool coordinator first, etc.
xen
: 4.13.5-9.44.1.xcpng8.2microcode_ctl
: microcode_ctl-2.1-26.xs29.5.xcpng8.2xcp-ng-release
: xcp-ng-release-8.2.1-13Normal use and anything else you want to test.
~ 1 day because of security updates.
Update published https://xcp-ng.org/blog/2024/08/16/august-security-update/
Thank you all for testing
My bad, we were a bit late and I tried to be quick and forgot to move it... Just did that, should be good soon, it needs some time to sync repos.
The update has been published, thank you for testing it out.
https://xcp-ng.org/blog/2024/03/15/march-2024-security-update/
Two new XSAs were published on 12th of March, in cunjunction with microcode updates from Intel.
xen-*
:microcode_ctl
: Security updates from intel:yum clean metadata --enablerepo=xcp-ng-testing
yum update "xen-*" microcode_ctl --enablerepo=xcp-ng-testing
reboot
The usual update rules apply: pool coordinator first, etc.
xen
: 4.13.5-9.39.1.xcpng8.2microcode_ctl
: 2.1-26.xs28.1.xcpng8.2Normal use and anything else you want to test.
2 days because of security updates.
We're still actively working on it, we're still not a 100% sure what the root cause is unfortunately.
It does seem to affect all Zen generations, from what we could gather, sligthly differently: it seems to be a bit better on zen3 and 4, but still always leading to underwhelming network performance for such machines.
To provide some status/context to you guys: I worked on this internally for a while, then as I had to attend other tasks we hired external help, which gave us some insight but no solution, and now we have @andSmv working on it (but not this week as he's at the Xen Summit).
From the contractors we had, we found that grant table and event channels have more occurences than on an intel xeon, looking like we're having more packet processed at first, but then they took way more time.
What Andrei found most recently is that PV & PVH (which we do not support officially), are getting about twice the performance of HVM and PVHVM. Also, having both dom0 and a guest pinned to a single physical core is also having better results. It seems to indicate it may come from the handling of cache coherency and could be related to guest memory settings that differs between intel and amd. That's what is under investigation right now, but we're unsure there will be any possibilty to change that.
I hope this helps make things a bit clearer to you guys, and shows we do invest a lot of time and money digging into this.
I'll investigate this further today to be a 100% sure, but the version of XZ we have is not impacted, plus we build from a copied tarball in our build system, so even if the tarball of this version was impacted later than the time we downloaded the tarball we would not be impacted.
We'll make a communication about it once I finished double checking it.
Olivier is currently on holidays, so I'll try to answer: it will mostly depend where the fix is implemented by XenServer, they do have some closed source parts, if that fix goes in such a part, we won't be able to get it back. If it is implemented in some of the open source part, we should be able to get it and integrate the change on our side.
Let's hope the fix ends up on the right side
@rmaclachlan it could be the kernel, but there is Xen between the kernel and the hardware, which does for example handle the cpu frequency scaling. If you're on a test machine and can spare the time maybe you can give a shot to Xen 4.17 on XCP-ng 8.3. But to be honest I would not think this will change much, but who knows until it's tested
Given Rix_IT is on 8.2 that allows us to know both versions seems affected, thanks!