Photo from Markus Winkler

Behind the scenes: debugging a pool join failure

Devblog Jun 30, 2026

Most support stories end when the user gets a solution.
This one is a bit different.

A user reported a pool join failure after upgrading from XCP-ng 8.2 to 8.3. What followed was a debugging journey that went from forum discussion to code analysis, root-cause identification and ultimately an upstream contribution improving error reporting for every future user.

We are sharing this article as it is a good illustration of how our engineering and support teams work together when investigating real-world issues.

Like many investigations, this one started on the XCP-ng forum, where users, contributors and Vates engineers regularly collaborate to solve real-world issues.


✒️
Written by LucienLassalle (Loka)
Behind this mysterious pseudonym is a DevOps & Cybersecurity Specialist at Vates.

When a pool join fails on a certificate

A user on the forum hit a wall that looks scary at first and turns out to be a small, fixable gap. They had a pool running XCP-ng, upgraded it from 8.2 to 8.3, and then tried to add a fresh host to grow their capacity. Adding the server in Xen Orchestra worked. Joining it to the pool did not. The join failed with this:

Stunnel.Stunnel_verify_error("0A000086:SSL routines::certificate verify failed")

If you have never seen it before, a TLS verification error on a pool join is the kind of message that tells you something is wrong without telling you what. This is the story of how I reproduced it, the thing that kept me puzzled, the source of truth that settled it, and the small change we landed upstream so the next person gets a clearer hint.

Reproducing it, and the part that puzzled me

The way I like to work a problem like this is to get into the same state as the person reporting it. That is harder when they are not a customer, and I cannot look at their setup, so I rebuilt the scenario myself: take a host on 8.2, upgrade it to 8.3, then try to join it to a pool.

Here is the part that kept me stuck for a while: at first, it just worked for me. My join succeeded. So the real question was not "what is broken" but "what is different between their setup and mine." That gap is what I had to chase.

Ruling out the usual suspects

TLS problems have a few usual suspects, and the thread worked through them. Listing them is worth it, because ruling things out is half the job:

  • Clock skew. With TLS, time matters: a certificate seen as "not yet valid" or "expired" because the clocks disagree will fail verification. Both hosts were synced to an NTP pool, so this was not it.
  • Certificate validity. The master's own certificate (/etc/xensource/xapi-pool-tls.pem) and the host certificate (/etc/xensource/xapi-ssl.pem) were valid and not expired. The certificates the hosts already had were fine.

The source of truth is the source code

From there, the best source of truth is the source code. I read through how XAPI handles the certificate exchange during a pool join. With the code in front of me, I asked the user to list every certificate XAPI actually uses.
It reported that one of those certificates was simply not present. That was the signal that there was something worth digging into.
With this answer, I was able to try to recreate the same situation:

  • Two 8.3 hosts in a pool. I removed the specified certificate, and I encountered the same error again.

The cause was clear: XCP-ng hosts talk to each other over TLS, tunneled through stunnel. When a host joins a pool, it has to trust the pool master's certificate. If the master's certificate is not where the joining host expects to find it, verification fails, and the join is refused. The generic SSL error is the symptom. The missing certificate is the cause.

Now, we need to fix the problem.
You might think it's just two commands, but when you value safety, you don't trust something without understanding it. So we had to be sure we were using the correct commands to avoid any side effects.

The fix

⚠️
While this fix resolved the issue described in this article, it may not be appropriate for every environment. Avoid applying troubleshooting commands blindly on production systems. If your infrastructure is covered by a support contract, open a support ticket first so the issue can be properly investigated before any corrective action is taken.

On the master, refresh its server certificate and sync it across the pool:

xe host-refresh-server-certificate host=$(hostname)
xe pool-certificate-sync

Then check that the certificate landed where it belongs:

ls -l /etc/stunnel/certs-pool

You should see a file named after the master's UUID, ending in .pem. It may instead appear as .new.pem; that form seems to work too, but if the join still refuses, copy it to the same name without the .new. After that, the host joins the pool cleanly.

You can confirm you are looking at the right certificate by matching fingerprints. Grab the master UUID, then compare the master's certificate with the one stored for the pool:

cat /etc/xensource-inventory | grep INSTALLATION_UUID | cut -d"'" -f2
openssl x509 -in /etc/xensource/xapi-pool-tls.pem -noout -fingerprint -sha256
openssl x509 -in /etc/stunnel/certs-pool/{MASTER_UUID}.pem -noout -fingerprint -sha256

Replace {MASTER_UUID} with the UUID the first command prints. The two fingerprints should match.

Why it happened

The trigger is the 8.2 to 8.3 upgrade path. A change landed in how XAPI handles the certificate exchange during the 8.2 line, so a host that was not fully up to date on 8.2 before moving to 8.3 can end up without the master certificate in its pool bundle. The certificates themselves are valid. The one that is needed is just not in the right place.

Making the next person's life easier

The hardest part here was not the fix. It was that there was no obvious error: the message said "certificate verify failed" and nothing more. So rather than stop at solving one case, I proposed a change to XAPI so the message points straight at the problem. Instead of a generic SSL failure, a pool join that fails this way now reports:

POOL_JOINING_MASTER_CERTIFICATE_NOT_IN_POOL_BUNDLE

That is a message you can act on without reading any code. The change went in as xen-api PR #7112 and is merged on master and on the 8.3 branch, so it will reach users in an upcoming release.

That last step is the part of this story I care about most. We are a software editor, and what we fix does not stop at our own product. Downstream matters, but we would rather the fix land upstream, so that everyone on XAPI benefits, not only our customers. A forum thread that starts with one user's failed pool join and ends as a merged upstream improvement is exactly how that is supposed to work.

From user report to platform improvement

A TLS error on a pool join sounds like a deep cryptography problem. Here, it was a certificate that did not get carried across an upgrade.

The path to it was ordinary: reproduce the failure, notice what differs from a working setup, ask the host what it actually has, and when something is missing, read what the code really does.

And when the error message is the thing that slowed you down, the best fix is to improve the message, upstream, for everyone.

The result

What started as a single user issue led to:

  • a documented root cause
  • a validated recovery procedure
  • a clearer error message in XAPI
  • an upstream contribution benefiting all future users

At Vates, solving an issue is only part of the job.

When possible, we aim to understand the root cause, improve the platform and contribute those improvements upstream.

In this case, a forum discussion led not only to a solution for one user, but also to a clearer error message that will help everyone running XCP-ng in the future.

Tags

Marc Pezin

CMO at Vates since 2017, I have been at the forefront of shaping and advancing the communication strategies and business development initiatives for Vates Virtualization Management Stack.