XCP-ng 8.3 updates announcements and testing

john.c

@stormi said in XCP-ng 8.3 updates announcements and testing:

IMPORTANT NOTICE!

After publishing the updates, we discovered a very nasty bug when using the UEFI certificates that we distribute. Long story short, they're too big, and there's only limited space (57K), and combined to a preexisting bug in varstored, this will cause the VM to stop booting after Windows or any other OS attempts to append to the DBX (revocation database).

We pulled the varstored update, but those who updated can be affected.

There are conditions for the issue:

Existing VMs are not affected, unless you propagated the new certs to them

New VMs are affected only if you never installed UEFI certs to the pool yourself (through XOA or secureboot-certs install), or cleared them using secureboot-certs clear in order to use our default certificates.

If you have the affected version of varstored (rpm -q varstored yields varstored-1.2.0-3.1.xcpng8.3) :

on every host, downgrade it with yum downgrade varstored-1.2.0-2.3.xcpng8.3. No reboot or toolstack restart required.

if you have affected UEFI VMs, that is VMs that meet the conditions above but are not broken yet, don't install updates, turn them off, and fix them by deleting their DBX database: https://docs.xcp-ng.org/guides/guest-UEFI-Secure-Boot/#remove-certificates-from-a-vm. This has to be done when the VM is off. Your OS will add its own DBX afterwards.

If you already have broken VMs (this warning reaching you too late), revert to a snapshot or backup. Other ways to fix them will require a patched varstored currently in the making.

@dinhngtu A little trick for the future when determining whether a user’s system, is affected by a bad update based on version, as well as remediation checks.

You can use “yum history list <packagename>”, to retrieve transaction IDs. The script can then iterate over the transaction IDs retrieving the package versions.

The specific transaction info can be retrieved with “yum history info <transaction_id>”. This will enable you to go back much further, thus seeing if remediation is required more easily!!

stormi

New security and maintenance update candidate for you to test!

A hardware issue was found in AMD Zen 5 CPU devices, related to how random numbers are generated. It's best fixed via a firmware update, but we also provide updated microcode to mitigate it, and Xen is updated to support loading the newer microcode. We also publish other non-urgent updates which we had in the pipe for the next update release.

Security updates:

amd-microcode: This release fixes vulnerability CVE-2025-62626 in AMD Zen 5 CPUs microcode that may generate excessive number of zeros in random outputs, potentially compromising cryptographic security.
xen:
- Introduce support for the new Linux AMD microcode container format (multiple blobs per CPU),
- Address the XSA-476 vulnerability (CVE-2025-58149), low severity on XCP-ng (affects an unsupported feature of Xen)
- Enable passthrough of devices on non-zero PCI segments.
- Improve performance of resumed or migrated VMs by supporting superpage restoration
- Fix detection of the Self Snooping feature on capable Intel CPUs
gpumon, xcp-featured: rebuilt for updated XAPI
qemu:
- Synchronize with XenServer's fix for the Windows Server 2025 NVMe write cache issue that we fixed previously
- Fix device passthrough with devices in a PCI segment different from 0
sm:
- Upstream changes:
  - Robustify CBT enable/disable calls to prevent errors.
  - Various fixes regarding SCSI commands/functions.
  - Add tolerance in the GC during leaf coalesce.
  - Improves GC logging and corrects rare race conditions.
- Our changes
  - Use serial instead of SCSI ID for SR on USB devices to prevent bad match.
  - Explicit error message during LVM metadata generation when VDI type is missing.
  - Correct and robustify LINSTOR deletion algorithm to manage in-use volumes.
  - Avoid throwing LINSTOR exceptions in case of impossible temporary volume deletion in order to properly terminate higher-level API calls.
  - Prevent XOSTOR operations if LINSTOR versions mismatches on a pool.
varstored:
- Restore and update the default dbx for new VMs. That's the main change for users: we now embed the latest UEFI certificates with XCP-ng, making pools ready for secure boot out of the box. We'll update the documentation to explain how to handle the transition for existing pools (ranging from "nothing to do" to "do something to ensure that future certificate updates become automatically the pool's default).
- Fix the format of the default included KEK/db/dbx to ensure safe updates
- Fix an issue with UEFI variable length limit
xapi:
- Support up to 16 VIFs (virtual network interfaces) per VM (previously: 7)
- Runnable metrics:
  - runnable_any
  - runnable_vcpus
- Various fixes, optimizations, small improvements, and foundational changes (such as getting prepared for a newer version of ocaml)
gpumon xcp-featured: rebuild for updated XAPI.
xcp-ng-pv-tools:
- Properly detect Red Hat 10 and its derivatives, when installing the Linux guest agent
- Update Windows Tools to 9.1.100
xcp-ng-release: fix benign "unary operator expected" error, displayed when connecting from some terminal software
xha: Nothing of note, minor changes such as logging typos...
xo-lite: version 0.17.0
- [VM/New] Fix the default topology by setting the platform:cores-per-socket value correctly (PR #9136)
- [Host/HostSystemResourceManagement] Fix display when control domain memory is undefined (PR [#9197])
xsconsole: Prepare for a future feature.

Optional packages updated:

qlogic-netxtreme2-alt: alternate driver for NetXtreme2 updated to version 7.15.24.
qlogic-qla2xxx-alt: alternate driver qla2xxx updated to version 10.02.14.01_k

Test on XCP-ng 8.3

yum clean metadata --enablerepo=xcp-ng-testing,xcp-ng-candidates
yum update --enablerepo=xcp-ng-testing,xcp-ng-candidates
reboot

The usual update rules apply: pool coordinator first, etc.

Do not apply these updates if you are using the QCOW2 disk format. QCOW2 testing requires specific update repositories. Updating via the normal test channels would render your disks invisible, and even once the necessary packages are restored, their metadata (which disk is attached to what VM, etc.) will be lost.

For QCOW2 testers, update with:

yum update --enablerepo=xcp-ng-testing,xcp-ng-candidates,xcp-ng-qcow2

For others who'd like to start testing with the QCOW2 format, please head towards the dedicated thread: https://xcp-ng.org/forum/topic/10308/dedicated-thread-removing-the-2tib-limit-with-qcow2-volumes

Versions:

amd-microcode: 20251203-1.1.xcpng8.3
gpumon: 24.1.0-71.1.xcpng8.3
qemu: 4.2.1-5.2.15.1.xcpng8.3
sm: 3.2.12-16.1.xcpng8.3
varstored: 1.2.0-3.4.xcpng8.3
xapi: 25.33.1-2.1.xcpng8.3
xcp-featured: 1.1.8-3.xcpng8.3
xcp-ng-pv-tools: 8.3-15.xcpng8.3
xcp-ng-release: 8.3.0-35
xen: 4.17.5-23.1.xcpng8.3
xha: 25.2.0-1.1.xcpng8.3
xo-lite: 0.17.0-1.xcpng8.3
xsconsole: 11.0.9.1-1.1.xcpng8.3.3

Optional packages:

qlogic-netxtreme2-alt: 7.15.24-1.xcpng8.3
qlogic-qla2xxx-alt: 10.02.14.01_k-1.xcpng8.3

What to test

Normal use and anything else you want to test.

Test window before official release of the updates

2 days.

ovicz

Updated to testing. Now all the VM's I have boot into the uefi shell.
Something is broken with this update.

Screenshot from 2025-12-17 11-02-09.png

stormi

@ovicz Is Secure Boot enabled on these VMs?

ovicz

@stormi on some yes on some no...I did enabled on one but no go.
They used to work before the update, no matter if secure boot was enabled or not.

flakpyro

@stormi Installed on my usual test hosts probably an hour after your initial post and let them run thoughout the day (Intel Minisforum MS-01, and Supermicro running a Xeon E-2336 CPU). Also installed onto a 2 host AMD epyc pool. Updates went smooth, backups continue to function as before.

A couple of windows VMs had secure boot enabled on our test pool. After the initial reboot i ran " secureboot-certs clear" on the pool master, then In XOA i clicked "Copy pool's default UEFI certificates to the VM" after that. The VMs continued to reboot without issue after. Strange to see someone else having issues with VMs not booting.

ovicz

@flakpyro I did what you said about clearing certs and reinstall them, still no go.

dinhngtu

@ovicz No UEFI mapping displayed in the shell suggests that the VM couldn't detect its disks. Is your SR doing OK?

ovicz

@dinhngtu [11:31 xcp-ng-akz ~]# zpool status
pool: ZFS_Pool
state: ONLINE
config:

NAME        STATE     READ WRITE CKSUM
ZFS_Pool    ONLINE       0     0     0
  sda       ONLINE       0     0     0

errors: No known data errors

I guess so...

Screenshot from 2025-12-17 11-34-47.png

dinhngtu

@ovicz How is your SR connected to the host, via NFS or something else?

ovicz

@dinhngtu No. It's a local SSD mounted with the zfs pool. Everything worked before the update. I don't see other errors in the host.

dinhngtu

@ovicz Let me call @storage

dinhngtu

@ovicz To confirm, you were using the ZFS integration documented here https://docs.xcp-ng.org/storage/#zfs right?

ovicz

@dinhngtu Yes. Like I said, all was good before the update from testing. The host xcp-ng is on another drive and the zfs pool is on other ssd if that matters. The host is using ext4 partitions.

lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 238.5G 0 disk
├─nvme0n1p5 259:4 0 4G 0 part /var/log
├─nvme0n1p3 259:3 0 512M 0 part /boot/efi
├─nvme0n1p1 259:1 0 18G 0 part /
├─nvme0n1p6 259:5 0 1G 0 part [SWAP]
└─nvme0n1p2 259:2 0 18G 0 part
sda 8:0 0 465.8G 0 disk
├─sda9 8:9 0 8M 0 part
└─sda1 8:1 0 465.8G 0 part

dinhngtu

@ovicz I'd start by listing the VDIs on your SR to see if things are still there.

ovicz

@dinhngtu xe sr-list
uuid ( RO) : 5956893e-7041-d424-a35e-6e7449238663
name-label ( RW): LocalISO
name-description ( RW):
host ( RO): xcp-ng-akz
type ( RO): iso
content-type ( RO): iso

uuid ( RO) : dccbfe9d-3e28-2163-2ea9-b0e972a42804
name-label ( RW): DVD drives
name-description ( RW): Physical DVD drives
host ( RO): xcp-ng-akz
type ( RO): udev
content-type ( RO): iso

uuid ( RO) : 8e908932-0577-d6f0-3133-14d94a317b90
name-label ( RW): LocalZFS
name-description ( RW):
host ( RO): xcp-ng-akz
type ( RO): zfs
content-type ( RO): user

uuid ( RO) : b37d8ad6-d9fa-203b-b057-a834908ae0e7
name-label ( RW): Removable storage
name-description ( RW):
host ( RO): xcp-ng-akz
type ( RO): udev
content-type ( RO): disk

uuid ( RO) : 34133934-0724-f9c1-3831-79cf54445fae
name-label ( RW): XCP-ng Tools
name-description ( RW): XCP-ng Tools ISOs
host ( RO): xcp-ng-akz
type ( RO): iso
content-type ( RO): iso!

Screenshot from 2025-12-17 11-42-47.png

dinhngtu

@ovicz How about xe vdi-list sr-uuid=8e908932-0577-d6f0-3133-14d94a317b90 ?

ovicz

@dinhngtu no ouput from that command.

dinhngtu

@ovicz Ok, I've contacted the storage team for a look.

ronan-a

@ovicz Can you share the /var/log/SMlog file from the master? It's quite strange considering that this driver is small and hasn't been modified.