XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.2.1 crash

    Scheduled Pinned Locked Moved Compute
    20 Posts 4 Posters 3.1k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • andSmvA Offline
      andSmv Vates 🪐 XCP-ng Team Xen Guru
      last edited by

      Hello, both issues seem to be related to memory corruption.

      • The first trace is an #NMI exception (one of the causes can be a parity error detected by the HW). Moreover, CPU#12 gets the #MC(machine check) exception. The #MC is triggered by the HW to notify the system software that there's an unrecoverable issue with the HW.
      • The second one is the invalid opcode in the Xen Hypervisor context. So it means that either the instruction flow is corrupted, or the instruction pointer is corrupted.

      My hypothesis is:

      In the first case - the ECC memory error is detected (and reported by HW) which makes the hypervisor panic and stop

      In the second case - the memory error is not detected (but the memory is still corrupted) but at some point, this corruption provokes the same result on the Xen hypervisor.

      Can you look with Hetzner guys if there's a way to change memory modules?

      The other way to validate this hypothesis is to install a different system software (another OS/hypervisor, another version of hypervisor) and see if you experience the same issue.

      You can also add on Xen command line "ler=true" option. This can give us more traces (leveraged by HW) to check if there's nothing abnormal on software level. I'll probably will need your Xen image with its symbole table (xen-syms-XXX and xen-syms-XXX.map)

      fdrcrtlF 1 Reply Last reply Reply Quote 2
      • fdrcrtlF Offline
        fdrcrtl @andSmv
        last edited by

        andSmv I'm blown away by your professionalism, thank you!
        Today another crash, i'll reverted grub to basic dom0_max_vcpus=4 dom0_vcpus_pin max_cstate=0 and ler=true (hope for another crash within 1-2 days)

        I'll schedule a deep check/memtest with Hetzner this weekend to see if they can address this issue, I'll keep you updated!

        PS. Are cpufreq=xen:performance max_cstate=1 iommu=0 a good combination for better performance/stability (no hw passthr)

        1 Reply Last reply Reply Quote 0
        • andSmvA Offline
          andSmv Vates 🪐 XCP-ng Team Xen Guru
          last edited by andSmv

          Thank you 🙂 👍 I hope we will quickly pinpoint the issue and find the solution for it.

          For your command line - I think it's a good thing if you are looking for performances and you have no use of PCI passthrough. Normally IOMMU is not involved if you do not have passthrough-ed devices, but we already experienced some issues on some platforms where IOMMU itself exhibits unstable behavior. So yes - it is better to disable it if you have no use.

          fdrcrtlF 1 Reply Last reply Reply Quote 0
          • fdrcrtlF Offline
            fdrcrtl @andSmv
            last edited by

            andSmv three crash in a row just now!
            grub.cfg: multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=4 dom0_vcpus_pin ler=true cpufreq=xen:performance max_cstate=1 iommu=0 crashkernel=256M,below=4G console=vga vga=mode-0x0311

            xen.log:(XEN) [ 711.242947] Panic on CPU 14:
            xen.log:(XEN) [ 711.242948] FATAL TRAP: vector = 6 (invalid opcode)

            xen.log:(XEN) [ 854.061272] Panic on CPU 8:
            xen.log:(XEN) [ 854.061273] FATAL TRAP: vector = 2 (nmi)

            xen.log:(XEN) [ 556.104951] Panic on CPU 14:
            xen.log:(XEN) [ 556.104951] FATAL TRAP: vector = 6 (invalid opcode)

            Dumped crash folder, kdump and .map files (where I could find them), what do you need/where to send? I'll powering off the host now for an extended memtest by hetzner

            1 Reply Last reply Reply Quote 0
            • fdrcrtlF Offline
              fdrcrtl
              last edited by

              Update before starting hw test by hetzner, they said "Please note that this server is 5000 series ryzen and it needs at least Linux kernel version 5.1 to run smoothly as it gets proper support from kernel version 5.12 and above. We have seen many problem from customers running kernel version below 5.1"

              Deep into the rabbit hole: https://bugzilla.kernel.org/show_bug.cgi?id=212087 - As xcpng running on 4.19 😓 ..

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Online
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                It's not Linux that is really "running" on the CPU but Xen (since your dom0 is a VM, not the "host").

                So the idea is to try to find what's causing issues on Xen with this CPU.

                1 Reply Last reply Reply Quote 0
                • fdrcrtlF Offline
                  fdrcrtl
                  last edited by fdrcrtl

                  Thanks for the clarification olivierlambert, just seen in the docs: Citrix Hypervisor 8.2, Base version of CentOS in dom0: 7.5, Xen 4.13.1 + patches, Kernel 4.19 + patches

                  Just want to give more info to the support team! Anyway from hetzner perspective is a negative point. Just for info, amd microcode is installed by default? Now the server is under testing, home they find something hw related

                  Update
                  Unfortunately test completed without any errors 😞

                  Your server finished the hardware check test without any hardware related issues. We boot the server back to the installed System. As we recommended try to use kernel version at least 5.1.

                  Summary of the test:

                  -----------------%<-----------------
                  DMESG: Ok
                  CPUFREQ-CHECK: Ok
                  STRESSTEST-CPU-TEMP: Ok
                  FANCHECK: Ok
                  STRESSTEST: Ok
                  MCE-CHECK: Ok

                  HDDTEST S64HNE0T******: Ok
                  HDDTEST S64HNE0T******: Ok

                  -----------------%<-----------------

                  1 Reply Last reply Reply Quote 0
                  • andSmvA Offline
                    andSmv Vates 🪐 XCP-ng Team Xen Guru
                    last edited by

                    Hmm, in the bugzilla thread the guys talk about adjusting SoC voltage and updating the BIOS. It still seems to me to be a HW problem... I will look through the whole thread and I will do some research about possible workarounds in newer Linux kernels for 5000 series ryzen.

                    1 Reply Last reply Reply Quote 1
                    • fdrcrtlF Offline
                      fdrcrtl
                      last edited by fdrcrtl

                      Right andSmv , what i've found so far

                      • Due wrong voltage reporting in kernel < 5.12, offset voltage had to be higher
                      • Implementing ZenStates may can help https://forum.level1techs.com/t/overclock-your-ryzen-cpu-from-linux/126025
                      • Some success from AMD forum: https://community.amd.com/t5/processors/ryzen-5900x-system-constantly-crashing-restarting-whea-logger-id/td-p/423321/page/84
                      • Some kernel patches neede for ryzen 5000 series: https://unix.stackexchange.com/questions/628222/what-changes-had-to-be-made-to-linux-kernel-in-order-to-support-ryzen-5000-serie

                      Dont' know if can help but I've added max_cstate=5 and cpufreq=xen:powersave to limit CPU usage and reduce power requirement. Those settings will be system-wide or only to xen?

                      1 Reply Last reply Reply Quote 0
                      • andSmvA Offline
                        andSmv Vates 🪐 XCP-ng Team Xen Guru
                        last edited by

                        To be honest, I would put cpufreq=none and max_cstate=0. This should disable the whole CPU P-states and C-states management by Xen. In this way, if there's any bug in firmware ACPI tables (or may be in the way Xen handles them) it would be possible to pinpoint this.

                        1 Reply Last reply Reply Quote 0
                        • andSmvA Offline
                          andSmv Vates 🪐 XCP-ng Team Xen Guru
                          last edited by

                          Thank you for all these links! I will look through them (need some time though)

                          1 Reply Last reply Reply Quote 0
                          • ron-gR Offline
                            ron-g
                            last edited by

                            FWIW, I was having similar kernel panics on my HP DL380G8 today. Two Xeon E5-2620 2 GHz, microcode version 0x71a. It's happened before, but only on a reboot.

                            Today, the kernel panics weren't consistent as to which CPU it was. I saw it get as high as CPU 22 and as low as CPU 3.

                            I was viewing POST via iLO remote console. After about an hour of allowing it to reboot on its own or with my manually resetting it via iLO GUI, I went to my data center and turned on the monitor and switched to the KVM channel the server was on. It came back up then. HTH.

                            1 Reply Last reply Reply Quote 0
                            • fdrcrtlF Offline
                              fdrcrtl
                              last edited by

                              Good morning, any update on this?
                              Meanwhile 60+ days stable with max_cstate=5 cpufreq=xen:powersave

                              fdrcrtlF 1 Reply Last reply Reply Quote 0
                              • fdrcrtlF Offline
                                fdrcrtl @fdrcrtl
                                last edited by

                                fdrcrtl andSmv olivierlambert
                                Bump, ty

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post