@RealTehreal Thank-you very much for that information. I'll follow up with Intel.
In the short term, I'd recommend just using the old microcode.
@RealTehreal Thank-you very much for that information. I'll follow up with Intel.
In the short term, I'd recommend just using the old microcode.
@RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:
If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check xl list
to confirm that you've only got Domain-0
and the one other VM, and note it's domid (the "ID" column).
In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).
Anyway, run xentrace -D -e 0x0008f000 xentrace.dmp
and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it using xenalyze -a xentrace.dmp |& less
.
Then, run xen-hvmctx $domid
two or three times, and share the output of all.
@t-chamberlain In addition to the XTF testing, could you also please (with the bad microcode) try booting Xen with spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg
from this run too.
@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg
from this run too.
@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.
Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.
Can you switch to using the debug hypervisor (change the /boot/xen.gz
symlink to point at the -d
suffixed hypervisor), and then capture xl dmesg
after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.
Could you also try running xtf
as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.
Windows isn't going to be tricked into being happy about the CPU just by changing the reported model. It cross-checks real features, and you simply can't fake those up.
There is no ability in XenServer/XCP-ng to configure this, and I have no intention to offer people the ability to shoot themselves in the foot like this.
@Andrew said in Xen 4.17 on XCP-ng 8.3!:
xtf HARD system freeze at test-hvm64-xsa-304. (only XCP hard lockup I have seen)
xtf With ept=no-exec-sp, all tests SKIP/SUCCESS.
It's guest exploitable, and locks up the CPU so hard it doesn't even reset properly. It's also very expensive to work around, hence why it's not mitigated by default.
So, we've had reports on xen-devel which look a little like this.
@BlueBadger are you able to switch back to your 7950x and try booting Xen with x2apic_phys=true
? It appears that the -X processors are missing a feature in their IOMMU and Xen was getting confused when setting up interrupt handling.
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=0d2686f6b66b4b1b3c72c3525083b0ce02830054 is at least part of the fix, but so far feedback on the mailing lists suggests it's not a complete fix.
This is way way outside of a normal-ish looking server usecase. I'm honestly surprised you've got anything to function...
To start with, you're probably booting Xen with console=vga
(because that's the default). It will be handed over to dom0 too, so start by going through the bootloader configuration and making sure that neither Xen nor dom0 are trying to use the display at all.
I suspect this is the root cause of the display going periodically back to black.
You cannot mix Intel and AMD CPUs in the same pool. They're simply not compatible enough for a VM to survive a migration between the two, so we explicitly disallow it. (You can't even --force
a pool join to this effect.)
@Andrew Intel E5450, that's very retro.
It's also first-gen VT-x and doesn't have HAP, which is why the test that is looking explicitly for HAP doesn't work.
As a stopgap, remove hap
from the VARY-CFG := hap shadow
line in tests/invlpg/Makefile
and rebuild. In the meantime I'll try to figure out a nice way to cope with this.
@olivierlambert said in XCP-ng 8.3 public alpha :
Your Xen guru badge is well earned @andyhhp
"purveyor of general grumpiness"
@Andrew Those are normal.
Bad rIP
is actually an error introduced in XSA-170 because someone misread the Intel manual. I've been trying to delete it upstream for years now. Its been so long that Intel nearly released a feature which would have required us to delete that check, and I successfully persuaded the Intel documentation team to add a footnote clarifying the statement which was misinterpreted during XSA-170.
At some point in my copious free never, I should restart the argument to delete it upstream...
The other two are logging from the XSA-260 fix. There's an error(/misfeature) in the x86 architecture and those would have been privilege escalations before the fix was in place. I decided when fixing XSA-260 that such attempts shouldn't be entirely silent, hence the one-liner. That particular printk() is actually common with other debugging routines, so can occur during regular development.
You're trying lie to the VM and tell it that it's running on a system with 24 physical sockets, each with a single core.
For reference, 2 sockets is the biggest AMD server that you can buy (these days), and Intel top out at 8. If you want a larger system, you could buy a SuperDome which can manage up to 32 sockets (before hitting other limits of UPI switching).
The various historical enumeration schemes can't encode that high, which is why there's a sanity check in XenCenter.
You typically want 1 socket, so select 24 cores / socket.
~Andrew
This is unconditional for a reason. The CSTATE errata in Nehalem are crippling - IIRC a core going in an out of a deep C state does not maintain cache coherency correctly, resulting in arbitrary memory corruption.
You really do care about not hitting this errata, even on a test/hobby server.
@planedrop If you boot Xen with nmi=dom0
, they'll be forwarded to dom0 rather than being treated as fatal.
Could you also get lspci -tv
for this system? The IO_PAGE_FAULT is for a different device to the one reporting an AER BadTLP in dom0 and has a wildly bogus address, so we need to figure out if the two errors are related or independent.
BadTLP is a problem, usually indicative of an electrical contact issue in the slot. Whatever is downstream of 00:01.1 wants unplugging, dusting out thoroughly, then confirming that it's adequately reseated.
@planedrop said in Passed Through GPU Crashes Host During Driver Install:
Had a Panic on CPU 0 code and a reboot.
Ok - lets do things one at a time. Can you start a new thread and provide the logs (ignore the vcpu/domain/stack hexdump log files. xca.log/xen.log/dom0.log are the interesting ones)
@planedrop Ok, so it's a host lockup rather than a crash. That's a bit more irritating to debug.
First of all, can you update to the debug hypervisor. Adjust the /boot/xen.gz -> $foo
symlink to use the version of Xen with the -d.gz
suffix. This is the same hypervisor changeset but with assertions and extra verbosity enabled.
Also, can you append ,keep
to Xen's vga=
option on the command line. This should cause Xen to keep on writing out onto the screen even after dom0 has started up. Depending on the system, this might be a bit glacial, but dom0 will come up eventually.
Then reproduce the hang. Hopefully there'll be some output from Xen before the system locks up. You might also want to consider adding noreboot
to Xen's command line too, especially if there's a backtrace and you want to take a photo of it to attach here.
@planedrop By host crash, do you mean a reboot, or something getting wedged and requiring manual intervention? Any logs in /var/crash/
in dom0?
Judging by the consumer motherboard, I presume you don't have a serial console. Anything show up on the screen at the point of crash?
@tomg That is the work, but it needs rebasing over the XSA-400 work, so a v4 series is going to be needed at a minimum.
HAP is Xen's vendor-neutral name for Intel EPT or AMD NPT hardware support. We have had superpage support for many years here.
IOMMU pagetables can either be shared with EPT/NPT (reduces the memory overhead of running the VM), or split (required for AMD due to hardware incompatibilities, and also required to support migration of a VM with an IO devices).
When pagetables are shared, the HAP superpage support gives the IOMMU superpages too (because they're literally the same set of pagetables in memory). When pagetables are split, HAP gets superpages while the IOMMU logic currently uses small pages.