XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Issue after latest host update

    Scheduled Pinned Locked Moved XCP-ng
    57 Posts 9 Posters 8.6k Views 9 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • RealTehrealR Offline
      RealTehreal @olivierlambert
      last edited by

      @olivierlambert Thank you very much for pointing out the real issue.

      1 Reply Last reply Reply Quote 0
      • RealTehrealR Offline
        RealTehreal
        last edited by RealTehreal

        What should happen now? Who should be informed about this issue with the microcode update? Is it still a XCP-NG issue, a Linux issue, or an Intel issue? Thank you in advance for clarification.

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          andyhhp Xen Guru @RealTehreal
          last edited by

          @RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.

          Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.

          Can you switch to using the debug hypervisor (change the /boot/xen.gz symlink to point at the -d suffixed hypervisor), and then capture xl dmesg after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.

          Could you also try running xtf as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.

          RealTehrealR 2 Replies Last reply Reply Quote 1
          • RealTehrealR Offline
            RealTehreal @andyhhp
            last edited by

            @andyhhp Sure thing. I'll just need some time, as I can only do such things in my free time.

            A 1 Reply Last reply Reply Quote 1
            • nikadeN Offline
              nikade Top contributor @RealTehreal
              last edited by

              @RealTehreal said in Issue after latest host update:

              @RealTehreal
              Step-by-step instructions, in case, someone else has the same issue:

              1.: yum history list to get the transaction id of the last update.

              2.: yum history info # with # being the id from step 1, to list the updates done in this transaction. The interesting part for me was

              Updated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64  
              Update                2:2.1-26.xs28.1.xcpng8.2.x86_64
              

              3.:yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 to downgrade to the previous version. You will have to enter the older version for this command.

              4.: Wait until it's done, reboot, test, pray it'll work again.

              This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.

              Thanks for sharing the resolution, im sure it will help someone else in the future.

              M 2 Replies Last reply Reply Quote 0
              • J Offline
                john.c @olivierlambert
                last edited by john.c

                @olivierlambert said in Issue after latest host update:

                @RealTehreal said in Issue after latest host update:

                Intel(R) Celeron(R) J4105 CPU @ 1.50GHz

                Another Gemini Lake… So it's clearly related.

                I had already found this out (its code name) then unfortunately things got busy so was unable to check the microcode notes or post this to the forum. It was without using cat /proc/cpuinfo.

                It was from the CPU listed on this web page (https://www.fujitsu.com/uk/products/computing/pc/thin-clients/futro-s740/). Then using Intel Ark on the Intel Celeron processor J4105 revealed it's code name along with a whole wealth of other useful information (https://ark.intel.com/content/www/us/en/ark/products/128989/intel-celeron-j4105-processor-4m-cache-up-to-2-50-ghz.html).

                1 Reply Last reply Reply Quote 0
                • A Offline
                  andyhhp Xen Guru @RealTehreal
                  last edited by andyhhp

                  @RealTehreal In addition to the XTF testing, could you also please try (with the bad microcode) booting Xen with spec-ctrl=no-verw on the command line, and seeing whether that changes the behaviour of your regular VMs? Please capture xl dmesg from this run too.

                  1 Reply Last reply Reply Quote 1
                  • stormiS Offline
                    stormi Vates 🪐 XCP-ng Team
                    last edited by

                    Doc about XTF testing: https://docs.xcp-ng.org/project/development-process/tests/#test-the-xen-hypervisor-itself

                    1 Reply Last reply Reply Quote 1
                    • RealTehrealR Offline
                      RealTehreal
                      last edited by

                      I'll do the testing on the weekend.

                      A 1 Reply Last reply Reply Quote 1
                      • A Offline
                        andyhhp Xen Guru @RealTehreal
                        last edited by

                        @RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:

                        If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check xl list to confirm that you've only got Domain-0 and the one other VM, and note it's domid (the "ID" column).

                        In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).

                        Anyway, run xentrace -D -e 0x0008f000 xentrace.dmp and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it using xenalyze -a xentrace.dmp |& less.

                        Then, run xen-hvmctx $domid two or three times, and share the output of all.

                        RealTehrealR 1 Reply Last reply Reply Quote 0
                        • RealTehrealR Offline
                          RealTehreal @andyhhp
                          last edited by

                          @andyhhp said in Issue after latest host update:

                          @RealTehreal It's an Intel issue, but while this is enough to show that there is an issue, it's not enough to figure out what is wrong.

                          Sadly, a VM falling into a busy loop can be one of many things. It's clearly on the (v)BSP prior to starting (v)APs, hence why it's only ever a single CPU spinning.

                          Can you switch to using the debug hypervisor (change the /boot/xen.gz symlink to point at the -d suffixed hypervisor), and then capture xl dmesg after trying to boot one VM. Depending on how broken things are, we might see some diagnostics.

                          Could you also try running xtf as described here: https://xcp-ng.org/forum/post/57804 It's a long-shot, but if it does happen to stumble on the issue, then it will be orders of magnitude easier to debug than something misc broken in the middle of OVMF.

                          First things first: here some information.

                          xl dmesg with debug kernel, bad microcode and after trying to run a VM: xl_dmesg_bad_microcode.txt

                          xtf short: xtf_short.txt

                          xtf long: xtf_long.txt

                          1 Reply Last reply Reply Quote 0
                          • RealTehrealR Offline
                            RealTehreal @andyhhp
                            last edited by

                            @andyhhp said in Issue after latest host update:

                            @RealTehreal Sorry to keep adding to the list of diagnostics, but everything here will help. After you've tried the other options, could you try this:

                            If the XTF testing shows any XTF test looping, use that single test, otherwise use your regular VM. Get one VM into the looping state. Check xl list to confirm that you've only got Domain-0 and the one other VM, and note it's domid (the "ID" column).

                            In dom0, run xentrace to capture a system trace. It's looping so the dump file is going to be large, but it also means that you can CTRL-C as quickly as you can on the shell and it will be fine (a few hundred milliseconds of samples will almost certainly be enough).

                            Anyway, run xentrace -D -e 0x0008f000 xentrace.dmp and then give me created xentrace.dmp file. If you're interested in what's in it, you can decode it using xenalyze -a xentrace.dmp |& less.

                            Then, run xen-hvmctx $domid two or three times, and share the output of all.

                            I sent you a pm.

                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              andyhhp Xen Guru @RealTehreal
                              last edited by

                              @RealTehreal Thank-you very much for that information. I'll follow up with Intel.

                              In the short term, I'd recommend just using the old microcode.

                              1 Reply Last reply Reply Quote 1
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                FYI I ordered and received a Mini PC based on a Celeron N4000 for internal testing (Gemini Lake "non-refresh"), and we were able to reproduce the issue 🙂

                                So as @andyhhp said, now we are 100% sure it's the microcode, it's up to Intel, who is now aware of this!

                                1 Reply Last reply Reply Quote 2
                                • M Offline
                                  mr_zz @nikade
                                  last edited by

                                  This post is deleted!
                                  1 Reply Last reply Reply Quote 0
                                  • M Offline
                                    mr_zz @nikade
                                    last edited by

                                    @nikade said in Issue after latest host update:

                                    @RealTehreal said in Issue after latest host update:

                                    @RealTehreal
                                    Step-by-step instructions, in case, someone else has the same issue:

                                    1.: yum history list to get the transaction id of the last update.

                                    2.: yum history info # with # being the id from step 1, to list the updates done in this transaction. The interesting part for me was

                                    Updated microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64  
                                    Update                2:2.1-26.xs28.1.xcpng8.2.x86_64
                                    

                                    3.:yum downgrade microcode_ctl-2:2.1-26.xs26.2.xcpng8.2.x86_64 to downgrade to the previous version. You will have to enter the older version for this command.

                                    4.: Wait until it's done, reboot, test, pray it'll work again.

                                    This is just a workaround! Microcode updates are important security and/or functional updates. Downgrading can lead to security issues.

                                    Thanks for sharing the resolution, im sure it will help someone else in the future.

                                    @nikade Here I am!
                                    Same problem (obviously) for Intel J4005 (I'm new to xcp-ng and I'm thinking about a homelab migration from Proxmox to an old NUC 7 as a test).
                                    So thank you all very much for clarifying me which of the 65 updates (from the initial downloaded image) was the problem that was driving me crazy in these days of testing!

                                    nikadeN 1 Reply Last reply Reply Quote 0
                                    • nikadeN Offline
                                      nikade Top contributor @mr_zz
                                      last edited by

                                      @mr_zz welcome to the forum 🙂

                                      A 1 Reply Last reply Reply Quote 0
                                      • A Offline
                                        andyhhp Xen Guru @nikade
                                        last edited by

                                        @RealTehreal I've got a fix from Intel, and @stormi has packaged it.

                                        yum update microcode_ctl --enablerepo=xcp-ng-testing should get you microcode_ctl-2.1-26.xs29.2.xcpng8.2 which has the fixed microcode for this issue in it.

                                        M 1 Reply Last reply Reply Quote 3
                                        • stormiS stormi referenced this topic on
                                        • M Offline
                                          mgigirey @andyhhp
                                          last edited by

                                          @andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm

                                          A stormiS 2 Replies Last reply Reply Quote 0
                                          • A Offline
                                            andyhhp Xen Guru @mgigirey
                                            last edited by

                                            @mgigirey said in Issue after latest host update:

                                            @andyhhp Any plans to update the intel-microcode for XCP-ng 8.3? latest know version working in my setup is intel-microcode-20231009-1.xcpng8.3.noarch.rpm

                                            I am not an XCP-ng developer. You'll have to ask @stormi for that.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post