XCP-ng 8.3 Varstored Update: Unbootable VM Risk and Remediation
varstored update for XCP-ng 8.3 was published for several hours on Thursday, October 23. Under certain conditions, this could lead to unbootable VMs that require recovery using a remediation script.This article describes the issue, who may be affected, how to verify your hosts and VMs before problems arise, and how to remediate if affected.
On Thursday, October 23, an update for the XCP-ng 8.3 component varstored was made available for several hours.
While this update aimed to improve the user experience with Guest UEFI Secure Boot by providing the necessary certificates automatically, we discovered that it was incompatible with subsequent certificate updates performed by the VM's operating system. Moreover, due to a preexisting bug, this could prevent VMs from starting after such an attempt.
We immediately pulled the update, updated the related blog post, and informed our support teams so they could assist users in recovering from any issues caused by this update.
Note: It is safe to install updates now, and has been since we removed the faulty varstored update on Thursday 23rd.
This article will address the following topics:
- How to determine if your XCP-ng servers were affected by the faulty update and how to revert it if necessary.
- How to detect Virtual Machines or VM templates affected and how to fix them proactively.
- How to recover an unbootable VM.
- What's coming next regarding
varstoredand Guest Secure Boot. - Technical details about what happened, for those interested.
But first, we want to apologize for this issue. The security and reliability of XCP-ng are our top priorities; however, we regret not having foreseen this problem before the update was released, which went through despite our internal testing infrastructure. We will add new tests to prevent similar issues in the future.
However, even with the faulty update, the likelihood that VMs were affected is low. Let's review how to verify this.
Are my hosts affected?
If you updated on Thursday, October 23, before the varstored update was pulled, your hosts are affected. All active XCP-ng update mirrors update within a maximum 1-hour delay, so we believe that no affected update was delivered after 23:59 UTC. However, we still advise verifying your hosts if you installed updates since last Thursday.
Verify your hosts via SSH with:
# rpm -q varstored- If it says
varstored-1.2.0-3.1.xcpng8.3.x86_64, then your host is currently affected. - If it says
varstored-1.2.0-2.3.xcpng8.3.x86_64, then your host has not been affected. In this case, you can stop reading here. - If the version is higher than
1.2.0-3.1, this means than you already installed a more recent update, published since the writing of this article. So we can't tell from this information whether you had installed the faulty update on October 23. Check/var/log/yum.logif in doubt, or follow the next steps, including verifying VMs.
⚠️ If your hosts are affected, immediately revert the varstored package to the previous version on every host:
# yum clean metadata --enablerepo=xcp-ng-lab
# yum update varstored-1.2.0-3.3.xcpng8.3 varstored-tools-1.2.0-3.3.xcpng8.3 --enablerepo=xcp-ng-labNo reboot required. Just restart the XAPI toolstack with:
# xe-toolstack-restartThen proceed to verify your pool certificates and your VMs, and to fix them if needed.
Are my VMs affected?
Even on affected hosts, several conditions must be met for a VM to be affected (which partly explains why we didn't detect the issue sooner).
As most users will not be affected, let's reverse the question: none of your VMs are affected if any of these conditions are met:
- Your pool never received the faulty update.
- You have not started any UEFI VM for the first time since it was created.
- You started new UEFI VMs for the first time, but they were created from a template derived from a VM that had already been started previously, from an XOA Hub template, or through other means based on a VM copy, export or snapshot.
- Your pool was already set up for Guest Secure Boot, and you did not run
secureboot-certs clearafter the faulty update to switch to the system defaults provided by the update.
In the other cases, we recommend to check your VMs.
Scan for affected VMs, snapshots and templates
The simplest way to determine whether VMs (or VM templates and snapshots created from an affected VM) are affected is to use our remediation script in detect mode.
First, ensure you have varstored-1.2.0-3.3 or later installed (see Are my hosts affected?).
Scan the pool:
# fix-efivars.py scan-poolThe script will first check whether your pool-level UEFI certificates are affected. If it finds affected certificates at the pool level, it will display this:
Scanning pool certs
Found pool UEFI variable db which needs fixing
Found pool UEFI variable KEK which needs fixing
Found pool UEFI variable dbx which needs fixing
Please update varstored to a release higher or equal than 1.2.0-3.3, and scan again. If the message persists, run 'secureboot-certs clear' and scan again to list affected VMs if there are any.If you see the above, run secureboot-certs clear, then scan again.
If the pool certificates are healthy, the same command will now scan your VMs:
# fix-efivars.py scan-poolIt will list any affected VM, snapshot or template.
How to fix an affected VM, snapshot, or template
At this point, we assume that you already updated varstored in order to get the remediation script.
Affected snapshots cannot be fixed. The only options here, to avoid any future issues, are to remove them, or to flag them in a way which ensures that no VM will be restored, or backup made, from them.
For each affected VM or template, you can first run the remediation script to scan it, without making permanent changes to it:
# fix-efivars.py check-vm <UUID>
INFO:root:Scanning VM <UUID>
INFO:root:Variable contains bogus data: d719b2cb-3d3a-4596-a3bc-dad00e67656f db
INFO:root:Variable contains bogus data: 8be4df61-93ca-11d2-aa0d-00e098032b8c KEK
INFO:root:Variable exceeds limit: d719b2cb-3d3a-4596-a3bc-dad00e67656f dbx (60440 > 57344)
Found 3 affected variable(s) (1 oversized) in VM <UUID>. VM needs propagating with pool certs.Then you can fix the VM, or template, by appending the --fix parameter:
# fix-efivars.py check-vm <UUID> --fix
INFO:root:Scanning VM <UUID>
Backing up existing variables to <UUID>.<TIMESTAMP>.efivars.b64
INFO:root:Variable contains bogus data: d719b2cb-3d3a-4596-a3bc-dad00e67656f db
INFO:root:Variable contains bogus data: 8be4df61-93ca-11d2-aa0d-00e098032b8c KEK
INFO:root:Variable exceeds limit: d719b2cb-3d3a-4596-a3bc-dad00e67656f dbx (60440 > 57344)
Fixed 3 affected variable(s) (1 oversized) in VM <UUID>. VM has been propagated with pool certs.Scan it again to verify that the VM is now healthy:
# fix-efivars.py check-vm <UUID>
INFO:root:Scanning VM <UUID>
(no more output)You can then start your VM. If your VM fails to start up (e.g. stuck at the console), we recommend disabling Secure Boot on your VM while waiting for the next, imminent, varstored update.
Technical details and context
The varstored update we released and then pulled back had been in development for months, within a context that spans years of work on Guest Secure Boot management.
All the complexity we always faced with this topic revolves around UEFI certificate management. For Secure Boot to be enforced, a set of certificates is needed to determine which binaries are allowed to boot. The default certificates, used by all mainstream operating systems, are managed and distributed by Microsoft. However, they are licensed in a way that does not allow free redistribution, which conflicts with the open-source licenses under which XCP-ng is distributed.
To work around this obstacle, we initially designed a way for you to download and install the required certificates with a simple command, once per pool, documented in our official documentation as secureboot-certs install. This is how it worked in XCP-ng 8.2 and currently works in XCP-ng 8.3.
However, this is not a perfect solution:
- It is a manual step after deploying a pool, whereas we would prefer it to work out of the box.
- It requires internet connectivity from the pool, or alternatively, downloading the files separately and running more complex commands.
- After the initial setup, certificate updates cannot be provided automatically; you need to run the command again. This includes the 2023 certificates that supersede the old 2011 certificates, as well as the regularly updated revocation database, and that we would provide to you with system updates.
Thus, when Microsoft made it possible to build our own certificate databases from source files distributed under a more permissive license, we decided to include them with the system. This is what the varstored update aimed to deliver.
Our tests were all green, the documentation updated, and the update would simplify new deployments. Existing pools would not be affected unless admins chose to switch from their initially deployed certificates to the new defaults.
However, there was one problem we had not foreseen, one case we had not tested (and should have): what happens when Windows or another OS wants to update the certificates in a VM? Each VM, after its first boot, has a copy of the certificates that are then managed by the OS itself. We had seen countless times how the update would simply append new certificates to the database. But this was not the case when our certificates were there.
We had followed Microsoft's documentation when building our certificate databases, and it mentioned that we should use our own identifier in the resulting artifacts, not Microsoft's. We did this, but it turned out that this instruction was intended only for items not coming from Microsoft's source files (after our recent feedback, they updated the documentation to avoid any confusion).
The consequence: Windows Update detects that the revocation database doesn't match its latest version. It attempts to update, but since it does not recognize the certificates as Microsoft's, it adds its own copy. There's only 57KB of space to hold the revocation database, and the addition of Microsoft's copy of it makes us go over the limit. That's where a pre-existing bug in varstored plays its role: instead of refusing to update, it lets the attempt corrupt the variables, and nothing works anymore, because varstored now complains about the bizarre state of the variable. The VM is now unbootable, until we fix its UEFI variables (which, thankfully, we can do).
A few hours after publishing the update, one of our developers found out about this issue and its consequences. That's when we decided to pull the update, and to update the announcement with a first set of instructions to avoid the issue and/or repair affected VMs. The present article is another step taken to address the issue.
What's next regarding Guest Secure Boot in XCP-ng
Thanks to new insight from Microsoft on https://github.com/microsoft/secureboot_objects, we now have a way to provide the default certificates without causing the previous issues anymore.
Our developers are working on a new update for varstored that will allow every pool to use Guest Secure Boot without requiring any manual action. These certificates will also be updated to the latest recommended by Microsoft and will continue receiving updates in the future. We will cover the details when the update is released.