Issue after latest host update
-
I'd be even fine to only use two machines and keep one of them offline for further testing.
-
For reference: I now decided to use a less intrusive approach and changed the default boot entry in grub config to the working failover entry. I will now try to get the pool up again.
-
What's the CPU on this? I would suspect a micro code update issue then.
-
-
@olivierlambert Following info from
/proc/cpuinfo
:
Intel(R) Celeron(R) J4105 CPU @ 1.50GHzTrue enough, regarding the Wyse topic. I'll try reverting only the microcode update and see, what happens.
-
@RealTehreal said in Issue after latest host update:
Intel(R) Celeron(R) J4105 CPU @ 1.50GHz
Another Gemini Lake… So it's clearly related.
-
@olivierlambert Yep, I can confirm that in this case the microcode update is the culprit, too.
I just downgraded
microcode_ctl-2.1-26.xs28.1.xcpng8.2.x86_64
to
microcode_ctl-2.1-26.xs26.2.xcpng8.2.x86_64
and it's working again. Man, what a mess.
-
@RealTehreal
Step-by-step instructions, in case, someone else has the same issue:1.:
yum history list
to get the transaction id of the last update.2.:
yum history info #
with # being the id from step 1, to list the updates done in this transaction. The interesting part for me wasUpdated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 Update 2:2.1-26.xs28.1.xcpng8.2.x86_64
3.:
yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64
to downgrade to the previous version. You will have to enter the older version for this command.4.: Wait until it's done, reboot, test, pray it'll work again.
This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.
-
@olivierlambert Thank you very much for pointing out the real issue.
-
What should happen now? Who should be informed about this issue with the microcode update? Is it still a XCP-NG issue, a Linux issue, or an Intel issue? Thank you in advance for clarification.
-
@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.
Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.
Can you switch to using the debug hypervisor (change the
/boot/xen.gz
symlink to point at the-d
suffixed hypervisor), and then capturexl dmesg
after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.Could you also try running
xtf
as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF. -
@andyhhp Sure thing. I'll just need some time, as I can only do such things in my free time.
-
@RealTehreal said in Issue after latest host update:
@RealTehreal
Step-by-step instructions, in case, someone else has the same issue:1.:
yum history list
to get the transaction id of the last update.2.:
yum history info #
with # being the id from step 1, to list the updates done in this transaction. The interesting part for me wasUpdated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 Update 2:2.1-26.xs28.1.xcpng8.2.x86_64
3.:
yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64
to downgrade to the previous version. You will have to enter the older version for this command.4.: Wait until it's done, reboot, test, pray it'll work again.
This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.
Thanks for sharing the resolution, im sure it will help someone else in the future.
-
@olivierlambert said in Issue after latest host update:
@RealTehreal said in Issue after latest host update:
Intel(R) Celeron(R) J4105 CPU @ 1.50GHz
Another Gemini Lake… So it's clearly related.
I had already found this out (its code name) then unfortunately things got busy so was unable to check the microcode notes or post this to the forum. It was without using cat /proc/cpuinfo.
It was from the CPU listed on this web page (https://www.fujitsu.com/uk/products/computing/pc/thin-clients/futro-s740/). Then using Intel Ark on the Intel Celeron processor J4105 revealed it's code name along with a whole wealth of other useful information (https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html).
-
@RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with
spec-ctrl=no-verw
on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capturexl dmesg
from this run too. -
Doc about XTF testing: https://docs.xcp-ng.org/project/development-process/tests/#test-the-xen-hypervisor-itself
-
I'll do the testing on the weekend.
-
@RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:
If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check
xl list
to confirm that you've only gotDomain-0
and the one other VM, and note it's domid (the "ID" column).In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).
Anyway, run
xentrace -D -e 0x0008f000 xentrace.dmp
and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it usingxenalyze -a xentrace.dmp |& less
.Then, run
xen-hvmctx $domid
two or three times, and share the output of all. -
@andyhhp said in Issue after latest host update:
@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.
Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.
Can you switch to using the debug hypervisor (change the
/boot/xen.gz
symlink to point at the-d
suffixed hypervisor), and then capturexl dmesg
after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.Could you also try running
xtf
as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.First things first: here some information.
xl dmesg
with debug kernel, bad microcode and after trying to run a VM: xl_dmesg_bad_microcode.txtxtf
short: xtf_short.txtxtf
long: xtf_long.txt -
@andyhhp said in Issue after latest host update:
@RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:
If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check
xl list
to confirm that you've only gotDomain-0
and the one other VM, and note it's domid (the "ID" column).In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).
Anyway, run
xentrace -D -e 0x0008f000 xentrace.dmp
and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it usingxenalyze -a xentrace.dmp |& less
.Then, run
xen-hvmctx $domid
two or three times, and share the output of all.I sent you a pm.