Issue after latest host update

RealTehreal

@olivierlambert I didn't change anything, at least. Just yum update and it went down the flush.

olivierlambert

I'm not sure the yum update is really related. It could be a coincidence, otherwise we would have been swamped in similar reports. Or it's a very specific combo that's unseen elsewhere.

What kind of hardware are we talking about?

RealTehreal

@olivierlambert I finally made some progress. And it really seems to be update related.

I took one of the hosts and plugged a display and keyboard into it. When booting up, I can choose to use an older version of Xen from the boot menu. Doing so makes VMs work again.

Culprit: Xen 4.13.5-9.39 (current default)
Working: Xen 4.13.4-9.19.1 (which I can choose from boot menu)

All three hosts are Fujitsu Futro 740 thin clients.

john.c

@RealTehreal said in Issue after latest host update:

@olivierlambert I finally made some progress. And it really seems to be update related.

I took one of the hosts and plugged a display and keyboard into it. When booting up, I can choose to use an older version of Xen from the boot menu. Doing so makes VMs work again.

Culprit: Xen 4.13.5-9.39 (current default)
Working: Xen 4.13.4-9.19.1 (which I can choose from boot menu)

All three hosts are Fujitsu Futro 740 thin clients.

What's the BIOS version of the Fujitsu Futro 740 and also the more exact model please? There's lots of Fujitsu Futro 740 thin clients, so you could be using any one of them.

RealTehreal

@john-c
Model: FUJITSU FUTRO S740/D3544-A1
BIOS: V5.0.0.13 R1.13.0 for D3544-A1x (09/23/2022)

john.c

@RealTehreal said in Issue after latest host update:

@john-c
Model: FUJITSU FUTRO S740/D3544-A1
BIOS: V5.0.0.13 R1.13.0 for D3544-A1x (09/23/2022)

Thanks that will help. As it enables identification if there's any issues, specific to that device. As well as its specific included CPU and its functions and features, especially its instruction set capabilities.

RealTehreal

@john-c All such information should be available in the dmesg file in post: https://xcp-ng.org/forum/post/74791

Any ideas on how to revert the update? I would really like to have the setup running again. It may be "just" a home lab, but I was still using it (at least semi-) productively...

RealTehreal

I'd be even fine to only use two machines and keep one of them offline for further testing.

RealTehreal

For reference: I now decided to use a less intrusive approach and changed the default boot entry in grub config to the working failover entry. I will now try to get the pool up again.

olivierlambert

What's the CPU on this? I would suspect a micro code update issue then.

olivierlambert

Could be related: https://xcp-ng.org/forum/topic/8736/wyse-5070-vm-won-t-booting-after-update-bios-1-27

RealTehreal

@olivierlambert Following info from /proc/cpuinfo:
Intel(R) Celeron(R) J4105 CPU @ 1.50GHz

True enough, regarding the Wyse topic. I'll try reverting only the microcode update and see, what happens.

olivierlambert

@RealTehreal said in Issue after latest host update:

Intel(R) Celeron(R) J4105 CPU @ 1.50GHz

Another Gemini Lake… So it's clearly related.

RealTehreal

@olivierlambert Yep, I can confirm that in this case the microcode update is the culprit, too.

I just downgraded
microcode_ctl-2.1-26.xs28.1.xcpng8.2.x86_64
to
microcode_ctl-2.1-26.xs26.2.xcpng8.2.x86_64

and it's working again. Man, what a mess.

RealTehreal

@RealTehreal
Step-by-step instructions, in case, someone else has the same issue:

1.: yum history list to get the transaction id of the last update.

2.: yum history info # with # being the id from step 1, to list the updates done in this transaction. The interesting part for me was

Updated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64  
Update                2:2.1-26.xs28.1.xcpng8.2.x86_64

3.:yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 to downgrade to the previous version. You will have to enter the older version for this command.

4.: Wait until it's done, reboot, test, pray it'll work again.

This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.

RealTehreal

@olivierlambert Thank you very much for pointing out the real issue.

RealTehreal

What should happen now? Who should be informed about this issue with the microcode update? Is it still a XCP-NG issue, a Linux issue, or an Intel issue? Thank you in advance for clarification.

andyhhp

@RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.

Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.

Can you switch to using the debug hypervisor (change the /boot/xen.gz symlink to point at the -d suffixed hypervisor), and then capture xl dmesg after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.

Could you also try running xtf as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.

RealTehreal

@andyhhp Sure thing. I'll just need some time, as I can only do such things in my free time.

nikade

@RealTehreal said in Issue after latest host update:

@RealTehreal
Step-by-step instructions, in case, someone else has the same issue:

1.: yum history list to get the transaction id of the last update.

2.: yum history info # with # being the id from step 1, to list the updates done in this transaction. The interesting part for me was
Updated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64  
Update                2:2.1-26.xs28.1.xcpng8.2.x86_64
3.:yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 to downgrade to the previous version. You will have to enter the older version for this command.

4.: Wait until it's done, reboot, test, pray it'll work again.

This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.

Thanks for sharing the resolution, im sure it will help someone else in the future.