This is a story about how a performance regression was tracked and solved, thanks to great work between our community, our XCP-ng team and the Xen Project.
It all started in October last year (the 20th exactly), with the security report and fix called
XSA-332 (details here). In there, a series of patches were delivered to the Linux kernel, changing the way events are handled, protecting
dom0 from DoS coming from rogue guests. It added a throttling mechanism, a pause after too many spurious events received. As usual, after doing some tests on our side, we bundled it among other security fixes and we released them in XCP-ng 8.1, as you can see in our November 2020 security update blog post.
And this was when the fun started!
Only few days after the update, feedback from our community started to describe a huge drop in network performance inside their pfSense virtual machines. This thread would become the main medium between XCP-ng and the affected users in a joint effort to debug the issue.
After a few weeks, our users identified the recent kernel update as the cause: Samuel quickly provided them with instructions to rebuild the kernel, while disabling the recent security patches one by one. Thanks to their tests, the 12th and last patch from XSA-332 was identified as what causes the network performance drop. And guess what? It was indeed the patch adding throttling!
Obviously, this couldn't have been that easy. The following weeks were more confusing, with various and inconclusive tests and reports. We were attempting (thanks Jon!) to reproduce every report to be sure we were digging into the right direction. And sometimes, we weren't 🤷
Fortunately, on the first day of 2021, someone reproduced the issue on Citrix Hypervisor. It was an important step: the problem wasn't specific to XCP-ng. At this stage Citrix did not identify the issue, since they don't support (nor test) FreeBSD guests. However on our side, our big community has many users relying on pfSense. This was clearly helping bring to light the performance drop in such short notice.
Few days later, the issue was consistently reproduced by our team, and we could build a test environment to investigate the issue in depth. Indeed, after disabling the infamous
patch12, we saw the performance coming back to a normal/pre-patch level in our internal lab. Sharing that modification to our community for people who want to test was also tremendously useful: we were going in the right direction! As one community member said:
This is the power of open source distributions like XCP-ng. You're able to build a custom RPM that excludes a troublesome patch while we wait for it to be addressed upstream.
So what about upstream then? Well, it was time to report it to the Xen Project! And we did 🐛
Oh, and don't forget to update your documentation while you are working upstream to fix the root cause. That's why our "Known issues" section was updated while we were providing the test kernel without patch 12 for users who needed it (with all warnings coming with it).
In February, we also got feedback from Citrix able to confirm the issue (thanks Andrew!), not through FreeBSD VMs but via PV-shim guests.
We started to work on IRC (Freenode,
#xendevel channel), with the author of the
XSA-332 patch series, Juergen Gross (from SuSE), but also with Roger Pau Monné (from Citrix). Roger helped because we weren't sure yet if there was something to fix in Xen's FreeBSD support itself.
A new patch series was in preparation to also add debugging facilities to the throttling mechanisms, in addition to fixing other issues caused by the
XSA-332. Spoiler alert: it wasn't solving the performance issue yet (though the tests yield significantly different results). Patch backporting to the kernel version in XCP-ng 8.2 and numerous kernel rebuilds followed 🛠️
At some point, Sam noticed that more than 98% of the events from a given test run with our pfSense VM were considered spurious, and the remaining events weren't numerous enough to account for the amount of data transferred in the test.
Immediately after this, Juergen noticed a mistake in the code that flagged an event as spurious. Hang on: it was a bitwise OR instead of a bitwise AND, a one character fix. This error was severely impacting FreeBSD VMs (but not only). That was it.
But that's not even the funniest thing. It was in one of the
XSA-332 patches, but not patch 12 actually. Disabling patch 12 did solve the issue for users because it disabled throttling altogether, but the real issue was that events were incorrectly flagged spurious, thus triggering the throttling.
In the following days, Sam announced to our users that a fix was found, and provided test kernels for both XCP-ng 8.1 and 8.2. Finally, still in February (the 26th), we took the opportunity of a kernel security update to add the fix and finally solve the issue for everyone.
One month later, a hotfix for Citrix Hypervisor 8.2 was released by Citrix, fixing this issue in addition to other consequences of
It was a bit of a saga, from initial report to the final resolution. Kudos to our community for their amazing work. This is also great proof that our team can handle very difficult tasks while being able to make a real bridge between our "end users" and deeply technical upstream projects like Xen. A big thanks to Sam and Jon on the XCP-ng team who handled most of the work 👏. And to the Xen Project team: thanks to Juergen, Roger and Andrew. It's always a pleasure to work with you guys!
In the end, so many people contributed one way or another to find or fix the problem. This is really why we love Open Source.