XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Applied recent patches ... Now getting CPU errors

    Scheduled Pinned Locked Moved Compute
    19 Posts 5 Posters 1.7k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      jcdick1 @stormi
      last edited by

      @stormi

      The problem is, the system won't boot. It gets to that error and immediately just reboots. I got a screenshot only via the iLO remote console and having my finger on the printscreen key ready to grab before it went back to POST.

      stormiS 1 Reply Last reply Reply Quote 0
      • A Offline
        Andrew Top contributor @jcdick1
        last edited by

        @jcdick1 Yes, that's the latest (last) BIOS update for the DL360p G8. There is a new iLO update, but that does not matter for this.

        I have a G8 with the E5-2680 (v1 chips) and I don't get the boot errors. But I do see reports of other (newer) CPUs with microcode update issues (not reports for XCP).

        You can try the linux "dis_ucode_ldr" option to disable the intel microcode loader but I have not tested it in XCP. You should be able to manually add it using grub boot loader at boot time for a quick test (at the grub boot menu). If it works then you can add it to grub.cfg or downgrade the microcode package or do kernel changes.

        https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
        

        If it does not work then you can try booting other kernels from grub or from USB...

        1 Reply Last reply Reply Quote 1
        • stormiS Offline
          stormi Vates 🪐 XCP-ng Team @jcdick1
          last edited by

          @jcdick1 you can boot the last entry in grub, it will boot the original kernel and xen from when you installed your host.

          J 1 Reply Last reply Reply Quote 0
          • J Offline
            jcdick1 @stormi
            last edited by jcdick1

            @stormi Even using the old kernel, I get issues booting. I'll try using my original install media to see if it will boot from USB. The problem I am having is that this is my master node, and I can't get another node to become master and get control of my VMs. Another node isn't becoming the new master for the remaining two nodes.

            Update: Booting to the 8.2.0 install ISO, it booted into the installer just fine.

            1 Reply Last reply Reply Quote 0
            • stormiS Offline
              stormi Vates 🪐 XCP-ng Team
              last edited by

              You should be able to trigger an emergency transition to master on another host of the pool.

              J 2 Replies Last reply Reply Quote 0
              • J Offline
                jcdick1 @stormi
                last edited by jcdick1

                @stormi I guess I don't know how to do that. Using the console "Resource Pool Configuration" -> "Designate a new pool master" results in a "This host is not a Pool member" message for both nodes.

                Edit: I found an older post with the xe commands for emergency transition.

                1 Reply Last reply Reply Quote 0
                • J Offline
                  jcdick1 @stormi
                  last edited by

                  @stormi I got a new master, and recovered the other slave. I did a fresh reinstall of the 8.2.0 ISO and the host came up just fine. So there's definitely something in the new code that breaks these machines.

                  1 Reply Last reply Reply Quote 0
                  • stormiS Offline
                    stormi Vates 🪐 XCP-ng Team
                    last edited by

                    Since reverting to the previous kernel did not solve your issue, I suspect this might be due to the microcode update.

                    But this microcode comes from Intel directly so it's surprising (although not impossible, they do break things from time to time).

                    If it's the microcode, this would also mean that you had not updated in a long time as the recent updates train did not contain any such microcode update.

                    You may try to update everything but the microcode. You could add something like this (untested) in /etc/yum.conf:

                    exclude=microcode_ctl*
                    

                    And then update everything else.

                    J 2 Replies Last reply Reply Quote 0
                    • J Offline
                      jcdick1 @stormi
                      last edited by

                      @stormi I'll try that. If I update via CLI, I think I can use "yum update --exclude=microcode_ctl*" and see what happens. If I remember correctly, it was stating 22 patches missing when I started this process with the patches released last week. Now the fresh install is stating 43 missing patches.

                      1 Reply Last reply Reply Quote 0
                      • J Offline
                        jcdick1 @stormi
                        last edited by

                        @stormi

                        Just as an FYI, the latest patches - 20190314-2.xcp - seem to not have an issue. The system patched and booted up just fine with no issues bringing up CPUs or anything like that.

                        stormiS 1 Reply Last reply Reply Quote 0
                        • stormiS Offline
                          stormi Vates 🪐 XCP-ng Team @jcdick1
                          last edited by

                          @jcdick1 They only update microcode for Fam 17h and 19h AMD CPUs 🙂

                          J 1 Reply Last reply Reply Quote 0
                          • J Offline
                            jcdick1 @stormi
                            last edited by

                            @stormi Yeah, I spoke too soon. Two of my hosts came right up fine after the patches. The third has been in a boot loop with CPU panics and fatal page faults.

                            DanpD 1 Reply Last reply Reply Quote 0
                            • DanpD Offline
                              Danp Pro Support Team @jcdick1
                              last edited by

                              @jcdick1 Could it be the same problem described here?

                              J 1 Reply Last reply Reply Quote 0
                              • J Offline
                                jcdick1 @Danp
                                last edited by jcdick1

                                @Danp No, but I seem to have figured it out. Its a weird one, considering the platform.

                                These are HP DL360s, with fully licensed iLOs. But ... the CPU errors are gone if I physically connect a keyboard to the server. If I reboot just monitoring via remote console in the iLO, I get the errors. If I go to the machines and connect a USB keyboard before the reboot, then go back to my workstation and do it all through XO and watch via remote console, they come up fine. My post earlier about two coming up fine, I remembered that coincidentally, I'd switched the KVM to them.

                                So just an FYI to anyone who might have the same problem, plug in a keyboard.

                                1 Reply Last reply Reply Quote 0
                                • A Offline
                                  Andrew Top contributor @jcdick1
                                  last edited by

                                  @jcdick1 I have almost the same servers and everything has been fine (without keyboards)...

                                  HP DL360p G8 with E5-2680 v2 and also some with E5-2680 (not v2). I did have one machine have problems but it was hardware failure (CPU fault), not XCP.

                                  These HP G8 machines are about a decade old now, so hardware issues are not a surprise.

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post