@RealTehreal I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
Hypervisor and kernel hacker. Citrix Hypervisor (formally XenServer), upstream Xen maintainer and security team member.
Probably knows a thing or two...
@RealTehreal I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
@rubberhose I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
When you've got that installed, it should be safe to update back to the latest firmware.
@tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.
HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.
IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).
When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.
@tomg said in PCI Nvidia GPU Passthrough enable permissive?:
It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.
What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.
These are known, and yeah - they are not great. It's an issue in Xen where the IOMMU logic doesn't (yet) support superpage mappings, so time delay you're observing is the time taken to map, unmap, and remap the GPU's massive BAR using 4k pages. (It's Qemu taking action in response to the actions of the guest.)
The good news is that IOMMU superpage support is in progress upstream, and should turn this delay into milliseconds.
I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.
(Hint: you uninstalled all of Xapi, hence why nothing works)
@RealTehreal Thank-you very much for that information. I'll follow up with Intel.
In the short term, I'd recommend just using the old microcode.
@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg
from this run too.
@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.
Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.
Can you switch to using the debug hypervisor (change the /boot/xen.gz
symlink to point at the -d
suffixed hypervisor), and then capture xl dmesg
after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.
Could you also try running xtf
as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.
@t-chamberlain I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
@t-chamberlain In addition to the XTF testing, could you also please (with the bad microcode) try booting Xen with spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg
from this run too.
@mgigirey said in Issue after latest host update:
@andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm
I am not an XCP-ng developer. You'll have to ask @stormi for that.
@eb-xcp said in XCP-ng 8.3 betas and RCs feedback :
Edit: Confirmed; after enabling execution disable option within bios, installer booted without issues and the install is currently ongoing.
That is a bug. Xen is supposed to be able to detect this case and re-activate NX on it's own.
For the EFI path in your screenshot, that one doesn't have logic to re-activate. IIRC, we weren't sure whether it was needed, because surely an EFI system wasn't still using Pentium4 compatibility. Clearly some wrong reasoning, and it's fairly easy to adjust.
However, fixing that path wont fix the normal MB2 path, which does have logic to reactivate and should have been able to cope fine.
What system do you have?
@flakpyro If Singlewire have already fixed the bug, then just do what is is necessary to update the VM and be done with it.
That screenshot of grub poses far more questions than it answered, and I doubt we want to get into any of them.
This is ultimately a bug in Linux. There was a range of Linux kernels which did something unsafe on kexec which worked most of the time but only by luck. (Specifically - holding a 64bit value in a register while passing through 32bit mode, and expecting it to still be intact later; both Intel and AMD identify this as having model specific behaviour and not to rely on it).
A consequence of a security fix in Xen (https://xenbits.xen.org/xsa/advisory-454.html) makes it reliably fail when depended upon in a VM.
Linux fixed the bug years ago, but one distro managed to pick it up.
Ideally, get SingleWire to fix their kernel. Failing that, adjust the VM's kernel command line to take any ,low
or ,high
off the crashkernel= line, because that was the underlying way to tickle the bug IIRC.
The property you need to end up with is that /proc/iomem
shows the Crash kernel
range being below the 4G boundary, because the handover logic from one kernel to the other simply didn't work correctly if the new kernel was above 4G.
Intel Xeon E5-2683 v4 CPUs vs E5-2697 v4 CPUs
You are correct. These are adjacent rows in the SKU table; they've got the same core count, and only differ by 500MHz frequency. They're basically identical as far as software is concerned.
I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.
(Hint: you uninstalled all of Xapi, hence why nothing works)
@rubberhose I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
When you've got that installed, it should be safe to update back to the latest firmware.
@t-chamberlain I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
@RealTehreal I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.
@RealTehreal Thank-you very much for that information. I'll follow up with Intel.
In the short term, I'd recommend just using the old microcode.