Applied recent patches ... Now getting CPU errors

jcdick1

I installed the patches that were posted just a few days ago, and now one of my nodes is having an issue with CPUs.

These are HP DL360 G8s with Xeon E5-2690 v2 CPUs.

At the point in the boot when it applies microcode, it will seemingly randomly have an issue with "bringing up" one of the CPUs and reboot.

Sometimes its CPU 1, as in the attached screenshot. On another reboot, it might get to CPU 12. The next time around, only to CPU 5. I have tried disabling hyperthreading, just to reduce the possibilities.

I don't know what "error -5" is and I can't find anything online about it.

The two attached screenshots are the same host, different cycles through rebooting.

Anyone have an idea of what may be at issue?

Thanks!

TheNorthernLight

@jcdick1 Are you fully firmware patched (BIOS microcode)?

stormi

It's always good to check on the firmware side indeed.

If you want to test with the previous kernel:

# yum downgrade won't work for the kernel because it's a protected package, so let's use rpm --oldpackage
yumdownloader kernel-4.19.19-7.0.11.1.xcpng8.2
rpm -Uv --oldpackage kernel-4.19.19-7.0.11.1.xcpng8.2.x86_64.rpm
reboot

If this does solve your issue, then with your help we could try to find what changed between the two kernels that caused this issue.

jcdick1

@thenorthernlight

I am at the latest I have access to, which is P71 05/24/2019 ... and near as I can tell searching HPE Support, that is the latest available. If there's a later one, I can't find it.

jcdick1

@stormi

The problem is, the system won't boot. It gets to that error and immediately just reboots. I got a screenshot only via the iLO remote console and having my finger on the printscreen key ready to grab before it went back to POST.

Andrew

@jcdick1 Yes, that's the latest (last) BIOS update for the DL360p G8. There is a new iLO update, but that does not matter for this.

I have a G8 with the E5-2680 (v1 chips) and I don't get the boot errors. But I do see reports of other (newer) CPUs with microcode update issues (not reports for XCP).

You can try the linux "dis_ucode_ldr" option to disable the intel microcode loader but I have not tested it in XCP. You should be able to manually add it using grub boot loader at boot time for a quick test (at the grub boot menu). If it works then you can add it to grub.cfg or downgrade the microcode package or do kernel changes.

https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

If it does not work then you can try booting other kernels from grub or from USB...

stormi

@jcdick1 you can boot the last entry in grub, it will boot the original kernel and xen from when you installed your host.

jcdick1

@stormi Even using the old kernel, I get issues booting. I'll try using my original install media to see if it will boot from USB. The problem I am having is that this is my master node, and I can't get another node to become master and get control of my VMs. Another node isn't becoming the new master for the remaining two nodes.

Update: Booting to the 8.2.0 install ISO, it booted into the installer just fine.

stormi

You should be able to trigger an emergency transition to master on another host of the pool.

jcdick1

@stormi I guess I don't know how to do that. Using the console "Resource Pool Configuration" -> "Designate a new pool master" results in a "This host is not a Pool member" message for both nodes.

Edit: I found an older post with the xe commands for emergency transition.

jcdick1

@stormi I got a new master, and recovered the other slave. I did a fresh reinstall of the 8.2.0 ISO and the host came up just fine. So there's definitely something in the new code that breaks these machines.

stormi

Since reverting to the previous kernel did not solve your issue, I suspect this might be due to the microcode update.

But this microcode comes from Intel directly so it's surprising (although not impossible, they do break things from time to time).

If it's the microcode, this would also mean that you had not updated in a long time as the recent updates train did not contain any such microcode update.

You may try to update everything but the microcode. You could add something like this (untested) in /etc/yum.conf:

exclude=microcode_ctl*

And then update everything else.

jcdick1

@stormi I'll try that. If I update via CLI, I think I can use "yum update --exclude=microcode_ctl*" and see what happens. If I remember correctly, it was stating 22 patches missing when I started this process with the patches released last week. Now the fresh install is stating 43 missing patches.

jcdick1

@stormi

Just as an FYI, the latest patches - 20190314-2.xcp - seem to not have an issue. The system patched and booted up just fine with no issues bringing up CPUs or anything like that.

stormi

@jcdick1 They only update microcode for Fam 17h and 19h AMD CPUs

jcdick1

@stormi Yeah, I spoke too soon. Two of my hosts came right up fine after the patches. The third has been in a boot loop with CPU panics and fatal page faults.

Danp

@jcdick1 Could it be the same problem described here?

jcdick1

@Danp No, but I seem to have figured it out. Its a weird one, considering the platform.

These are HP DL360s, with fully licensed iLOs. But ... the CPU errors are gone if I physically connect a keyboard to the server. If I reboot just monitoring via remote console in the iLO, I get the errors. If I go to the machines and connect a USB keyboard before the reboot, then go back to my workstation and do it all through XO and watch via remote console, they come up fine. My post earlier about two coming up fine, I remembered that coincidentally, I'd switched the KVM to them.

So just an FYI to anyone who might have the same problem, plug in a keyboard.

Andrew

@jcdick1 I have almost the same servers and everything has been fine (without keyboards)...

HP DL360p G8 with E5-2680 v2 and also some with E5-2680 (not v2). I did have one machine have problems but it was hardware failure (CPU fault), not XCP.

These HP G8 machines are about a decade old now, so hardware issues are not a surprise.