As I said before, this is looking like a buggy CPU, and you've proved it, given a week with no incident if CPU8 is excluded.
Posts
-
RE: XCP-ng 8.3 with VM crashing
-
RE: Diagnosing frequent crashes on host
@the_jest Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?
-
RE: Diagnosing frequent crashes on host
@the_jest said in Diagnosing frequent crashes on host:
but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)
Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.
@the_jest said in Diagnosing frequent crashes on host:
I've looked at /var/crash, but there's so much stuff there I don't know where to start,
The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.
-
RE: XCP-ng 8.3 with VM crashing
@AlbertK Thanks. There's no nested-virt configured there.
I have to admit this is looking more and more like buggy CPU. Memory corruption is a possibility, but this is a clearly corrupt field in the middle of otherwise sane-looking fields in the VMCB.
Do you have any other identical systems? Can you swap this CPU out for another one to see what happens?
-
RE: XCP-ng 8.3 with VM crashing
@AlbertK None of those commands are relevant in a Xen system. You want
xe vm-param-list uuid=$VM
-
RE: XCP-ng 8.3 with VM crashing
@AlbertK That looks suspiciously like you've enabled nested virt in the VM. Can you confirm whether you have or not?
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
what kind of magic have you put in the last 7 patches?
You've got a very recent AMD processor, so it's probably this fix https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=86001b3970fea4536048607ea6e12541736c48e1 from upstream.
-
RE: Issue after latest host update
@mgigirey said in Issue after latest host update:
@andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm
I am not an XCP-ng developer. You'll have to ask @stormi for that.
-
RE: XCP-ng 8.3 betas and RCs feedback 🚀
@eb-xcp said in XCP-ng 8.3 betas and RCs feedback
:
Edit: Confirmed; after enabling execution disable option within bios, installer booted without issues and the install is currently ongoing.
That is a bug. Xen is supposed to be able to detect this case and re-activate NX on it's own.
For the EFI path in your screenshot, that one doesn't have logic to re-activate. IIRC, we weren't sure whether it was needed, because surely an EFI system wasn't still using Pentium4 compatibility. Clearly some wrong reasoning, and it's fairly easy to adjust.
However, fixing that path wont fix the normal MB2 path, which does have logic to reactivate and should have been able to cope fine.
What system do you have?
-
RE: XCP-ng 8.3 betas and RCs feedback 🚀
@flakpyro If Singlewire have already fixed the bug, then just do what is is necessary to update the VM and be done with it.
That screenshot of grub poses far more questions than it answered, and I doubt we want to get into any of them.
-
RE: XCP-ng 8.3 betas and RCs feedback 🚀
This is ultimately a bug in Linux. There was a range of Linux kernels which did something unsafe on kexec which worked most of the time but only by luck. (Specifically - holding a 64bit value in a register while passing through 32bit mode, and expecting it to still be intact later; both Intel and AMD identify this as having model specific behaviour and not to rely on it).
A consequence of a security fix in Xen (https://xenbits.xen.org/xsa/advisory-454.html) makes it reliably fail when depended upon in a VM.
Linux fixed the bug years ago, but one distro managed to pick it up.
Ideally, get SingleWire to fix their kernel. Failing that, adjust the VM's kernel command line to take any
,low
or,high
off the crashkernel= line, because that was the underlying way to tickle the bug IIRC.The property you need to end up with is that
/proc/iomem
shows theCrash kernel
range being below the 4G boundary, because the handover logic from one kernel to the other simply didn't work correctly if the new kernel was above 4G. -
RE: Any way to know what features will be CPU masked before adding a host to a pool?
Intel Xeon E5-2683 v4 CPUs vs E5-2697 v4 CPUs
You are correct. These are adjacent rows in the SKU table; they've got the same core count, and only differ by 500MHz frequency. They're basically identical as far as software is concerned.
-
RE: Oops! We removed busybox
I suggest using this as a learning opportunity. Look at the RPM log and see what depends on busybox, and therefore what (else) got uninstalled in order to keep the dependencies satisfied.
(Hint: you uninstalled all of Xapi, hence why nothing works)
-
RE: Dell Wyse FW update breaks VM booting; console frozen; TianoCore/EDK2 related?
@rubberhose I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get youmicrocode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it.When you've got that installed, it should be safe to update back to the latest firmware.
-
RE: Wyse 5070 VM won't booting after update bios 1.27
@t-chamberlain I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get youmicrocode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it. -
RE: Issue after latest host update
@RealTehreal I've got a fix from Intel, and @stormi has packaged it.
yum update microcode_ctl --enablerepo=xcp-ng-testing
should get youmicrocode_ctl-2.1-26.xs29.2.xcpng8.2
which has the fixed microcode for this issue in it. -
RE: Issue after latest host update
@RealTehreal Thank-you very much for that information. I'll follow up with Intel.
In the short term, I'd recommend just using the old microcode.
-
RE: Issue after latest host update
@RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:
If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check
xl list
to confirm that you've only gotDomain-0
and the one other VM, and note it's domid (the "ID" column).In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).
Anyway, run
xentrace -D -e 0x0008f000 xentrace.dmp
and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it usingxenalyze -a xentrace.dmp |& less
.Then, run
xen-hvmctx $domid
two or three times, and share the output of all. -
RE: Wyse 5070 VM won't booting after update bios 1.27
@t-chamberlain In addition to the XTF testing, could you also please (with the bad microcode) try booting Xen with
spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capturexl dmesg
from this run too. -
RE: Issue after latest host update
@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with
spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capturexl dmesg
from this run too.