Posts made by andyhhp | XCP-ng and XO forum

andyhhp

As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.

andyhhp

@the_jest Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?

andyhhp

@the_jest said in Diagnosing frequent crashes on host:

but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.

@the_jest said in Diagnosing frequent crashes on host:

I've looked at /var/crash, but there's so much stuff there I don't know where to start,

The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.

andyhhp

@AlbertK Thanks. There's no nested-virt configured there.

I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.

Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?

andyhhp

@AlbertK None of those commands are relevant in a Xen system. You want xe vm-param-list uuid=$VM

andyhhp

@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?

andyhhp

@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:

what kind of magic have you put in the last 7 patches?

You've got a very recent AMD processor, so it's probably this fix https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=86001b3970fea4536048607ea6e12541736c48e1 from upstream.

andyhhp

@mgigirey said in Issue after latest host update:

@andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm

I am not an XCP-ng developer. You'll have to ask @stormi for that.

andyhhp

@eb-xcp said in XCP-ng 8.3 betas and RCs feedback :

Edit: Confirmed; after enabling execution disable option within bios, installer booted without issues and the install is currently ongoing.

That is a bug. Xen is supposed to be able to detect this case and re-activate NX on it's own.

For the EFI path in your screenshot, that one doesn't have logic to re-activate. IIRC, we weren't sure whether it was needed, because surely an EFI system wasn't still using Pentium4 compatibility. Clearly some wrong reasoning, and it's fairly easy to adjust.

However, fixing that path wont fix the normal MB2 path, which does have logic to reactivate and should have been able to cope fine.

What system do you have?

andyhhp

@flakpyro If Singlewire have already fixed the bug, then just do what is is necessary to update the VM and be done with it.

That screenshot of grub poses far more questions than it answered, and I doubt we want to get into any of them.

andyhhp

@flakpyro

This is ultimately a bug in Linux. There was a range of Linux kernels which did something unsafe on kexec which worked most of the time but only by luck. (Specifically - holding a 64bit value in a register while passing through 32bit mode, and expecting it to still be intact later; both Intel and AMD identify this as having model specific behaviour and not to rely on it).

A consequence of a security fix in Xen (https://xenbits.xen.org/xsa/advisory-454.html) makes it reliably fail when depended upon in a VM.

Linux fixed the bug years ago, but one distro managed to pick it up.

Ideally, get SingleWire to fix their kernel. Failing that, adjust the VM's kernel command line to take any ,low or ,high off the crashkernel= line, because that was the underlying way to tickle the bug IIRC.

The property you need to end up with is that /proc/iomem shows the Crash kernel range being below the 4G boundary, because the handover logic from one kernel to the other simply didn't work correctly if the new kernel was above 4G.

andyhhp

Intel Xeon E5-2683 v4 CPUs vs E5-2697 v4 CPUs

You are correct. These are adjacent rows in the SKU table; they've got the same core count, and only differ by 500MHz frequency. They're basically identical as far as software is concerned.

andyhhp

I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.

(Hint: you uninstalled all of Xapi, hence why nothing works)

andyhhp

@rubberhose I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

When you've got that installed, it should be safe to update back to the latest firmware.

andyhhp

@t-chamberlain I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

andyhhp

@RealTehreal I've got a fix from Intel, and @stormi has packaged it.

yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

andyhhp

@RealTehreal Thank-you very much for that information. I'll follow up with Intel.

In the short term, I'd recommend just using the old microcode.

andyhhp

@RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:

If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check xl list to confirm that you've only got Domain-0 and the one other VM, and note it's domid (the "ID" column).

In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).

Anyway, run xentrace -D -e 0x0008f000 xentrace.dmp and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it using xenalyze -a xentrace.dmp |& less.

Then, run xen-hvmctx $domid two or three times, and share the output of all.

andyhhp

@t-chamberlain In addition to the XTF testing, could you also please (with the bad microcode) try booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.

andyhhp

@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.