Andrew Cooper

andyhhp

@RealTehreal I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

andyhhp

@rubberhose I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

When you've got that installed, it should be safe to update back to the latest firmware.

andyhhp

@tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.

HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.

IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).

When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.

andyhhp

@tomg said in PCI Nvidia GPU Passthrough enable permissive?:

It appears to be consistently 20 - 30 seconds per RTX Ampere GPU, about 20 - 25 seconds on Quadros and ~90 seconds on an A100.
What's worse on the A100, it seems the calls are made linear so say I pass through four A100s the wait time to boot will be 4x90s, not optimal.

These are known, and yeah - they are not great. It's an issue in Xen where the IOMMU logic doesn't (yet) support superpage mappings, so time delay you're observing is the time taken to map, unmap, and remap the GPU's massive BAR using 4k pages. (It's Qemu taking action in response to the actions of the guest.)

The good news is that IOMMU superpage support is in progress upstream, and should turn this delay into milliseconds.

andyhhp

I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.

(Hint: you uninstalled all of Xapi, hence why nothing works)

andyhhp

@RealTehreal Thank-you very much for that information. I'll follow up with Intel.

In the short term, I'd recommend just using the old microcode.

andyhhp

@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.

andyhhp

@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.

Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.

Can you switch to using the debug hypervisor (change the /boot/xen.gz symlink to point at the -d suffixed hypervisor), and then capture xl dmesg after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.

Could you also try running xtf as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.

andyhhp

@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

andyhhp

@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:

what kind of magic have you put in the last 7 patches?

You've got a very recent AMD processor, so it's probably this fix https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=86001b3970fea4536048607ea6e12541736c48e1 from upstream.

andyhhp

Xen has no awareness of 3D V-Cache. All 16 cores will be considered equal. Your vCPU may be on a 3D V-Cache core one millisecond, then on a no-3D V-Cache core the next.

If you really want to alter this, you can pin your VM to one group of cores or the other.

However, do not make the mistake of thinking of some of these cores as "performance cores" while the others not. The ones with 3D V-Cache will outperform the others on a wide variety of workloads despite not being able to turbo to the same degree.

andyhhp

As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

andyhhp

@the_jest Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?

andyhhp

@the_jest said in Diagnosing frequent crashes on host:

but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.

@the_jest said in Diagnosing frequent crashes on host:

I've looked at /var/crash, but there's so much stuff there I don't know where to start,

The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.

andyhhp

@AlbertK Thanks. There's no nested-virt configured there.

I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

andyhhp

@AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

andyhhp

@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

andyhhp

@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:

what kind of magic have you put in the last 7 patches?

You've got a very recent AMD processor, so it's probably this fix https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=86001b3970fea4536048607ea6e12541736c48e1 from upstream.

andyhhp

@mgigirey said in Issue after latest host update:

@andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm

I am not an XCP-ng developer. You'll have to ask @stormi for that.

andyhhp

@eb-xcp said in XCP-ng 8.3 betas and RCs feedback :

Edit: Confirmed; after enabling execution disable option within bios, installer booted without issues and the install is currently ongoing.

That is a bug. Xen is supposed to be able to detect this case and re-activate NX on it's own.

For the EFI path in your screenshot, that one doesn't have logic to re-activate. IIRC, we weren't sure whether it was needed, because surely an EFI system wasn't still using Pentium4 compatibility. Clearly some wrong reasoning, and it's fairly easy to adjust.

However, fixing that path wont fix the normal MB2 path, which does have logic to reactivate and should have been able to cope fine.

What system do you have?

Andrew Cooper

@andyhhp

Best posts made by andyhhp

Latest posts made by andyhhp