XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Applied recent patches ... Now getting CPU errors

    Scheduled Pinned Locked Moved Compute
    19 Posts 5 Posters 3.0k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • TheNorthernLightT Offline
      TheNorthernLight @jcdick1
      last edited by

      @jcdick1 Are you fully firmware patched (BIOS microcode)?

      J 1 Reply Last reply Reply Quote 1
      • stormiS Offline
        stormi Vates 🪐 XCP-ng Team
        last edited by

        It's always good to check on the firmware side indeed.

        If you want to test with the previous kernel:

        # yum downgrade won't work for the kernel because it's a protected package, so let's use rpm --oldpackage
        yumdownloader kernel-4.19.19-7.0.11.1.xcpng8.2
        rpm -Uv --oldpackage kernel-4.19.19-7.0.11.1.xcpng8.2.x86_64.rpm
        reboot
        

        If this does solve your issue, then with your help we could try to find what changed between the two kernels that caused this issue.

        J 1 Reply Last reply Reply Quote 0
        • J Offline
          jcdick1 @TheNorthernLight
          last edited by

          @thenorthernlight

          I am at the latest I have access to, which is P71 05/24/2019 ... and near as I can tell searching HPE Support, that is the latest available. If there's a later one, I can't find it.

          A 1 Reply Last reply Reply Quote 0
          • J Offline
            jcdick1 @stormi
            last edited by

            @stormi

            The problem is, the system won't boot. It gets to that error and immediately just reboots. I got a screenshot only via the iLO remote console and having my finger on the printscreen key ready to grab before it went back to POST.

            stormiS 1 Reply Last reply Reply Quote 0
            • A Online
              Andrew Top contributor @jcdick1
              last edited by

              @jcdick1 Yes, that's the latest (last) BIOS update for the DL360p G8. There is a new iLO update, but that does not matter for this.

              I have a G8 with the E5-2680 (v1 chips) and I don't get the boot errors. But I do see reports of other (newer) CPUs with microcode update issues (not reports for XCP).

              You can try the linux "dis_ucode_ldr" option to disable the intel microcode loader but I have not tested it in XCP. You should be able to manually add it using grub boot loader at boot time for a quick test (at the grub boot menu). If it works then you can add it to grub.cfg or downgrade the microcode package or do kernel changes.

              https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
              

              If it does not work then you can try booting other kernels from grub or from USB...

              1 Reply Last reply Reply Quote 1
              • stormiS Offline
                stormi Vates 🪐 XCP-ng Team @jcdick1
                last edited by

                @jcdick1 you can boot the last entry in grub, it will boot the original kernel and xen from when you installed your host.

                J 1 Reply Last reply Reply Quote 0
                • J Offline
                  jcdick1 @stormi
                  last edited by jcdick1

                  @stormi Even using the old kernel, I get issues booting. I'll try using my original install media to see if it will boot from USB. The problem I am having is that this is my master node, and I can't get another node to become master and get control of my VMs. Another node isn't becoming the new master for the remaining two nodes.

                  Update: Booting to the 8.2.0 install ISO, it booted into the installer just fine.

                  1 Reply Last reply Reply Quote 0
                  • stormiS Offline
                    stormi Vates 🪐 XCP-ng Team
                    last edited by

                    You should be able to trigger an emergency transition to master on another host of the pool.

                    J 2 Replies Last reply Reply Quote 0
                    • J Offline
                      jcdick1 @stormi
                      last edited by jcdick1

                      @stormi I guess I don't know how to do that. Using the console "Resource Pool Configuration" -> "Designate a new pool master" results in a "This host is not a Pool member" message for both nodes.

                      Edit: I found an older post with the xe commands for emergency transition.

                      1 Reply Last reply Reply Quote 0
                      • J Offline
                        jcdick1 @stormi
                        last edited by

                        @stormi I got a new master, and recovered the other slave. I did a fresh reinstall of the 8.2.0 ISO and the host came up just fine. So there's definitely something in the new code that breaks these machines.

                        1 Reply Last reply Reply Quote 0
                        • stormiS Offline
                          stormi Vates 🪐 XCP-ng Team
                          last edited by

                          Since reverting to the previous kernel did not solve your issue, I suspect this might be due to the microcode update.

                          But this microcode comes from Intel directly so it's surprising (although not impossible, they do break things from time to time).

                          If it's the microcode, this would also mean that you had not updated in a long time as the recent updates train did not contain any such microcode update.

                          You may try to update everything but the microcode. You could add something like this (untested) in /etc/yum.conf:

                          exclude=microcode_ctl*
                          

                          And then update everything else.

                          J 2 Replies Last reply Reply Quote 0
                          • J Offline
                            jcdick1 @stormi
                            last edited by

                            @stormi I'll try that. If I update via CLI, I think I can use "yum update --exclude=microcode_ctl*" and see what happens. If I remember correctly, it was stating 22 patches missing when I started this process with the patches released last week. Now the fresh install is stating 43 missing patches.

                            1 Reply Last reply Reply Quote 0
                            • J Offline
                              jcdick1 @stormi
                              last edited by

                              @stormi

                              Just as an FYI, the latest patches - 20190314-2.xcp - seem to not have an issue. The system patched and booted up just fine with no issues bringing up CPUs or anything like that.

                              stormiS 1 Reply Last reply Reply Quote 0
                              • stormiS Offline
                                stormi Vates 🪐 XCP-ng Team @jcdick1
                                last edited by

                                @jcdick1 They only update microcode for Fam 17h and 19h AMD CPUs 🙂

                                J 1 Reply Last reply Reply Quote 0
                                • J Offline
                                  jcdick1 @stormi
                                  last edited by

                                  @stormi Yeah, I spoke too soon. Two of my hosts came right up fine after the patches. The third has been in a boot loop with CPU panics and fatal page faults.

                                  DanpD 1 Reply Last reply Reply Quote 0
                                  • DanpD Online
                                    Danp Pro Support Team @jcdick1
                                    last edited by

                                    @jcdick1 Could it be the same problem described here?

                                    J 1 Reply Last reply Reply Quote 0
                                    • J Offline
                                      jcdick1 @Danp
                                      last edited by jcdick1

                                      @Danp No, but I seem to have figured it out. Its a weird one, considering the platform.

                                      These are HP DL360s, with fully licensed iLOs. But ... the CPU errors are gone if I physically connect a keyboard to the server. If I reboot just monitoring via remote console in the iLO, I get the errors. If I go to the machines and connect a USB keyboard before the reboot, then go back to my workstation and do it all through XO and watch via remote console, they come up fine. My post earlier about two coming up fine, I remembered that coincidentally, I'd switched the KVM to them.

                                      So just an FYI to anyone who might have the same problem, plug in a keyboard.

                                      1 Reply Last reply Reply Quote 0
                                      • A Online
                                        Andrew Top contributor @jcdick1
                                        last edited by

                                        @jcdick1 I have almost the same servers and everything has been fine (without keyboards)...

                                        HP DL360p G8 with E5-2680 v2 and also some with E5-2680 (not v2). I did have one machine have problems but it was hardware failure (CPU fault), not XCP.

                                        These HP G8 machines are about a decade old now, so hardware issues are not a surprise.

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post