Best posts made by andyhhp | XCP-ng and XO forum

andyhhp

@RealTehreal I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

andyhhp

@rubberhose I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

When you've got that installed, it should be safe to update back to the latest firmware.

andyhhp

@tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.

HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.

IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).

When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.

andyhhp

@tomg said in PCI Nvidia GPU Passthrough enable permissive?:

It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.
What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.

These are known, and yeah - they are not great. It's an issue in Xen where the IOMMU logic doesn't (yet) support superpage mappings, so time delay you're observing is the time taken to map, unmap, and remap the GPU's massive BAR using 4k pages. (It's Qemu taking action in response to the actions of the guest.)

The good news is that IOMMU superpage support is in progress upstream, and should turn this delay into milliseconds.

andyhhp

I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.

(Hint: you uninstalled all of Xapi, hence why nothing works)

andyhhp

@RealTehreal Thank-you very much for that information. I'll follow up with Intel.

In the short term, I'd recommend just using the old microcode.

andyhhp

@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.

andyhhp

@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.

Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.

Can you switch to using the debug hypervisor (change the /boot/xen.gz symlink to point at the -d suffixed hypervisor), and then capture xl dmesg after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.

Could you also try running xtf as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.

andyhhp

@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

andyhhp

@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:

what kind of magic have you put in the last 7 patches?

You've got a very recent AMD processor, so it's probably this fix https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=86001b3970fea4536048607ea6e12541736c48e1 from upstream.

andyhhp

@t-chamberlain I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

andyhhp

@t-chamberlain In addition to the XTF testing, could you also please (with the bad microcode) try booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.

andyhhp

@flakpyro

This is ultimately a bug in Linux. There was a range of Linux kernels which did something unsafe on kexec which worked most of the time but only by luck. (Specifically - holding a 64bit value in a register while passing through 32bit mode, and expecting it to still be intact later; both Intel and AMD identify this as having model specific behaviour and not to rely on it).

A consequence of a security fix in Xen (https://xenbits.xen.org/xsa/advisory-454.html) makes it reliably fail when depended upon in a VM.

Linux fixed the bug years ago, but one distro managed to pick it up.

Ideally, get SingleWire to fix their kernel. Failing that, adjust the VM's kernel command line to take any ,low or ,high off the crashkernel= line, because that was the underlying way to tickle the bug IIRC.

The property you need to end up with is that /proc/iomem shows the Crash kernel range being below the 4G boundary, because the handover logic from one kernel to the other simply didn't work correctly if the new kernel was above 4G.

andyhhp

@Andrew said in Xen 4.17 on XCP-ng 8.3!:

xtf HARD system freeze at test-hvm64-xsa-304. (only XCP hard lockup I have seen)
xtf With ept=no-exec-sp, all tests SKIP/SUCCESS.

XSA-304 is https://www.intel.com/content/www/us/en/developer/articles/troubleshooting/software-security-guidance/technical-documentation/machine-check-error-avoidance-page-size-change.html

It's guest exploitable, and locks up the CPU so hard it doesn't even reset properly. It's also very expensive to work around, hence why it's not mitigated by default.

andyhhp

@Andrew Intel E5450, that's very retro.

It's also first-gen VT-x and doesn't have HAP, which is why the test that is looking explicitly for HAP doesn't work.

As a stopgap, remove hap from the VARY-CFG := hap shadow line in tests/invlpg/Makefile and rebuild. In the meantime I'll try to figure out a nice way to cope with this.

andyhhp

@olivierlambert said in XCP-ng 8.3 public alpha :

Your Xen guru badge is well earned @andyhhp

"purveyor of general grumpiness"

andyhhp

@Andrew Those are normal.

Bad rIP is actually an error introduced in XSA-170 because someone misread the Intel manual. I've been trying to delete it upstream for years now. Its been so long that Intel nearly released a feature which would have required us to delete that check, and I successfully persuaded the Intel documentation team to add a footnote clarifying the statement which was misinterpreted during XSA-170.

At some point in my copious free never, I should restart the argument to delete it upstream...

The other two are logging from the XSA-260 fix. There's an error(/misfeature) in the x86 architecture and those would have been privilege escalations before the fix was in place. I decided when fixing XSA-260 that such attempts shouldn't be entirely silent, hence the one-liner. That particular printk() is actually common with other debugging routines, so can occur during regular development.

andyhhp

Intel Xeon E5-2683 v4 CPUs vs E5-2697 v4 CPUs

You are correct. These are adjacent rows in the SKU table; they've got the same core count, and only differ by 500MHz frequency. They're basically identical as far as software is concerned.

andyhhp

So, we've had reports on xen-devel which look a little like this.

@BlueBadger are you able to switch back to your 7950x and try booting Xen with x2apic_phys=true ? It appears that the -X processors are missing a feature in their IOMMU and Xen was getting confused when setting up interrupt handling.

https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=0d2686f6b66b4b1b3c72c3525083b0ce02830054 is at least part of the fix, but so far feedback on the mailing lists suggests it's not a complete fix.

andyhhp

This is way way outside of a normal-ish looking server usecase. I'm honestly surprised you've got anything to function...

To start with, you're probably booting Xen with console=vga (because that's the default). It will be handed over to dom0 too, so start by going through the bootloader configuration and making sure that neither Xen nor dom0 are trying to use the display at all.

I suspect this is the root cause of the display going periodically back to black.

Posts