XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    XCP-ng 8.2.1 crash

    Scheduled Pinned Locked Moved Compute
    20 Posts 4 Posters 3.1k Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • fdrcrtlF Offline
      fdrcrtl
      last edited by

      Hi forum, I'm experiencing crash on my AX101 hosted at hetzner (5950X, ecc RAM -tested, no probs) with no load (2VM x 2core 4gb). I was able to find this on /var/crash dumps, how can I investigate more? TY

      grub cfg

      multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=4 dom0_vcpus_pin cpufreq=xen:performance max_cstate=1 iommu=0 crashkernel=256M,below=4G console=vga vga=mode-0x0311
      	module2 /boot/vmlinuz-4.19-xen root=LABEL=root-ebdyoh ro nolvm hpet=disable rd.auto console=hvc0 console=tty0 vga=785 plymouth.ignore-serial-consoles
      
      
      (XEN) [73341.747357] ----[ Xen-4.13.4-9.22.2  x86_64  debug=n   Not tainted ]----
      (XEN) [73341.747358] CPU:    12
      (XEN) [73341.747358] RIP:    e008:[<ffff82d08022cb69>] common/sched_credit.c#csched_unit_wake+0xf9/0x6e0
      (XEN) [73341.747362] RFLAGS: 0000000000010046   CONTEXT: hypervisor
      (XEN) [73341.747363] rax: ffff831fd31961e0   rbx: ffff82d080598088   rcx: ffff831fd31961e0
      (XEN) [73341.747363] rdx: ffff831fd31961e0   rsi: 00000000d318ef30   rdi: ffff831fd31961e0
      (XEN) [73341.747364] rbp: ffff83202bcad5f8   rsp: ffff831fd39ffc30   r8:  000042b403efae21
      (XEN) [73341.747364] r9:  ffff83202bcac000   r10: 000042b434cc56d9   r11: 0000000000000001
      (XEN) [73341.747365] r12: ffff82d0805bc300   r13: ffff82d08059704c   r14: ffff831fd318ef30
      (XEN) [73341.747365] r15: ffff831fd318ee60   cr0: 000000008005003b   cr4: 00000000003506e0
      (XEN) [73341.747366] cr3: 00000000cce7b000   cr2: ffff82d092cd836c
      (XEN) [73341.747366] fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
      (XEN) [73341.747366] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
      (XEN) [73345.479207] Watchdog timer detects that CPU13 is stuck!
      (XEN) [73345.479208] ----[ Xen-4.13.4-9.22.2  x86_64  debug=n   Not tainted ]----
      (XEN) [73345.479208] CPU:    13
      (XEN) [73345.479209] RIP:    e008:[<ffff82d08023faf5>] _spin_lock_recursive+0x35/0x70
      (XEN) [73345.479210] RFLAGS: 0000000000000002   CONTEXT: hypervisor
      (XEN) [73345.479211] rax: 00000000000044aa   rbx: 0000000000000096   rcx: 000000000000000d
      (XEN) [73345.479211] rdx: 00000000000044ab   rsi: ffff82d0803c44aa   rdi: ffff82d08047eb18
      (XEN) [73345.479212] rbp: 0000000000000000   rsp: ffff831fd39efb70   r8:  0000000000000000
      (XEN) [73345.479212] r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
      (XEN) [73345.479213] r12: 0000000000000000   r13: 0000000000000000   r14: ffff831fd39efc28
      (XEN) [73345.479213] r15: ffff82d092cd836c   cr0: 000000008005003b   cr4: 00000000003506e0
      (XEN) [73345.479214] cr3: 00000000cce7b000   cr2: ffff82d092cd836c
      (XEN) [73345.479214] fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
      (XEN) [73345.479214] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
      (XEN) [73345.479216] Xen code around <ffff82d08023faf5> (_spin_lock_recursive+0x35/0x70):
      (XEN) [73345.479216]  74 0b 90 f3 90 66 8b 07 <66> 39 c2 75 f6 48 89 e2 48 8d 05 48 75 35 00 48
      (XEN) [73345.479218] Xen stack trace from rsp=ffff831fd39efb70:
      (XEN) [73345.479218]    ffff82d08024f140 ffff831fd39efc28 ffff82d0802ae75e 0000000000000000
      (XEN) [73345.479219]    0000000000000000 0000000000000000 ffff82d0802af837 ffff82d000000000
      (XEN) [73345.479220]    00000000d3193210 0000000000000000 0000000000000087 0000000000000000
      (XEN) [73345.479221]    00000000cce77063 0000000000000000 0000000000000096 000000000000000d
      (XEN) [73345.479221]    ffff831fd39f3000 0000000000000000 0000000000000000 0000000000000000
      (XEN) [73345.479222]    ffff831fd39effff 0000000000000000 ffff82d080370a17 ffff831206417da0
      (XEN) [73345.479223]    ffff831206417e90 ffff82d08059704c ffff82d0805bc300 ffff831fd37298f8
      (XEN) [73345.479223]    ffff82d080598088 ffff82d080597e28 000042b43831e5c9 ffff83121d04c000
      (XEN) [73345.479224]    000042b403efae15 ffff831fd31931f0 ffff831fd31931f0 ffff831fd31931f0
      (XEN) [73345.479225]    0000000006417e90 ffff831fd31931f0 0000000e00000000 ffff82d08022cb69
      (XEN) [73345.479225]    000000000000e008 0000000000010046 ffff831fd39efcd0 0000000000000000
      (XEN) [73345.479226]    0000000000000000 ffff831206417da0 0000000000000010 ffff82d080598088
      (XEN) [73345.479227]    0000000001c9c380 000042b432d43ad5 ffff831fd31931f0 ffff83131bd83db8
      (XEN) [73345.479228]    0000000100000000 ffff82d080270a80 ffff82d080288d1c ffff831206402068
      (XEN) [73345.479228]    ffff831fd37299a0 ffff831206402000 0000000000000000 ffff831fd37298e0
      (XEN) [73345.479229]    ffff83121d04c000 0000000000000000 0000000080000000 0000000000000000
      (XEN) [73345.479230]    0000000000000000 ffff831fd39f3000 00000000003506e0 0000000000000046
      (XEN) [73345.479230]    0000000000000000 ffff82d08027f216 000000000000000d ffff831206402000
      (XEN) [73345.479231]    ffff831fd37298f8 ffff83121d04cb84 ffff82d08059704c ffff82d080598088
      (XEN) [73345.479232]    ffff831206417da0 ffff82d08023af68 0000000000000202 ffff83121d04c000
      (XEN) [73345.479232] Xen call trace:
      (XEN) [73345.479233]    [<ffff82d08023faf5>] R _spin_lock_recursive+0x35/0x70
      (XEN) [73345.479234]    [<ffff82d08024f140>] S console_lock_recursive_irqsave+0x10/0x20
      (XEN) [73345.479236]    [<ffff82d0802ae75e>] S show_execution_state+0xe/0x40
      (XEN) [73345.479237]    [<ffff82d0802af837>] S do_page_fault+0x657/0x760
      (XEN) [73345.479239]    [<ffff82d080370a17>] S x86_64/entry.S#handle_exception_saved+0x68/0x94
      (XEN) [73345.479240]    [<ffff82d08022cb69>] S common/sched_credit.c#csched_unit_wake+0xf9/0x6e0
      (XEN) [73345.479242]    [<ffff82d080270a80>] S apic_timer_interrupt+0/0x30
      (XEN) [73345.479243]    [<ffff82d080288d1c>] S do_IRQ+0x29c/0x720
      (XEN) [73345.479244]    [<ffff82d08027f216>] S arch/x86/i387.c#_vcpu_save_fpu+0x86/0x180
      (XEN) [73345.479245]    [<ffff82d08023af68>] S vcpu_wake+0x258/0x6c0
      (XEN) [73345.479246]    [<ffff82d08020ae96>] S send_guest_vcpu_virq+0xd6/0xe0
      (XEN) [73345.479247]    [<ffff82d08027c9f5>] S vcpu_kick+0x15/0x60
      (XEN) [73345.479248]    [<ffff82d080241c4d>] S common/tasklet.c#do_tasklet_work+0x6d/0xb0
      (XEN) [73345.479249]    [<ffff82d080241ccc>] S common/tasklet.c#tasklet_softirq_action+0x3c/0x70
      (XEN) [73345.479250]    [<ffff82d08023ef82>] S common/softirq.c#__do_softirq+0x62/0x90
      (XEN) [73345.479251]    [<ffff82d080278a35>] S arch/x86/domain.c#idle_loop+0x65/0xf0
      (XEN) [73345.479251]    [<ffff82d0802789d0>] S arch/x86/domain.c#idle_loop+0/0xf0
      (XEN) [73345.479251] 
      (XEN) [73345.479255] CPU0 @ e008:ffff82d08023f905 (_spin_lock_irqsave+0x25/0x50)
      (XEN) [73345.479256] CPU1 @ e008:ffff82d08023f8b5 (_spin_lock_irq+0x25/0x50)
      (XEN) [73345.479258] CPU2 @ e008:ffff82d0802a71f0 (flush_area_mask+0xf0/0x140)
      (XEN) [73345.479260] CPU3 @ e008:ffff82d08023f865 (_spin_lock+0x25/0x50)
      (XEN) [73345.479261] CPU12 @ e008:ffff82d0802e7ca2 (mcheck_cmn_handler+0x302/0x470)
      (XEN) [73345.479773] CPU11 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479775] CPU10 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479781] CPU26 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479785] CPU27 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479788] CPU9 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479791] CPU8 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479795] CPU24 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479798] CPU25 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479801] CPU7 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479803] CPU6 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479807] CPU23 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479810] CPU22 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479814] CPU5 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479816] CPU4 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479819] CPU21 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479823] CPU20 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479826] CPU15 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479828] CPU14 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479831] CPU19 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479834] CPU18 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479838] CPU17 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479840] CPU16 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479844] CPU31 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479846] CPU30 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479850] CPU29 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.479852] CPU28 @ e008:ffff82d0802d9c68 (arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x98/0xc0)
      (XEN) [73345.480261] 
      (XEN) [73345.480261] ****************************************
      (XEN) [73345.480262] Panic on CPU 13:
      (XEN) [73345.480262] FATAL TRAP: vector = 2 (nmi)
      (XEN) [73345.480263] [error_code=0000] , IN INTERRUPT CONTEXT
      (XEN) [73345.480263] ****************************************
      (XEN) [73345.480263] 
      (XEN) [73345.480263] Reboot in five seconds...
      (XEN) [73345.480264] Executing kexec image on cpu13
      (XEN) [73345.481278] Shot down all CPUs
      
      (XEN) [  459.021596] r15: ffff82d0805bc300   cr0: 000000008005003b   cr4: 00000000003506e0
      (XEN) [  459.021596] cr3: 00000012064cd000   cr2: 0000000800418c60
      (XEN) [  459.021596] fsb: 0000000801006120   gsb: ffffffff82610000   gss: 0000000000000000
      (XEN) [  459.021597] ds: 0000   es: 0000   fs: 0013   gs: 001b   ss: 0000   cs: e008
      (XEN) [  459.021598] Xen code around <ffff82d08027bdd2> (context_switch+0x862/0xdd0):
      (XEN) [  459.021599]  d2 e9 1d fa ff ff f3 48 <0f> ae d8 eb ce 0f 01 f8 f3 48 0f ae d8 0f 01 f8
      (XEN) [  459.021601] Xen stack trace from rsp=ffff831fd39c7da8:
      (XEN) [  459.021601]    0000000000000000 ffff83120640d060 ffff82d080597160 ffff831fd39d2000
      (XEN) [  459.021602]    ffff83120640d000 ffff831fd39e7ec0 0000006adfd1840c ffff83120640d000
      (XEN) [  459.021602]    ffff82d0805bc300 ffff82d08023d243 ffff8312064e7f00 000000000000000f
      (XEN) [  459.021603]    000000000000000f ffff8312064e7f00 ffff831fd39e7ec0 ffff82d08023da94
      (XEN) [  459.021604]    ffff831f00000001 ffff831fd39e7e00 0000000000000200 ffff831fd39e7e18
      (XEN) [  459.021604]    ffff831206409000 ffff82d0803087c5 0000000000000078 ffff82d080308873
      (XEN) [  459.021605]    ffff83120640d000 ffff83120640d000 ffff83122de4a000 ffff82d080300440
      (XEN) [  459.021606]    ffff831206409000 00000000ffffffff ffffffffffffffff ffff831fd39c7fff
      (XEN) [  459.021607]    ffff82d08059db00 0000000000000000 0000000000000000 ffff82d08023ef82
      (XEN) [  459.021607]    ffff83120640d000 ffff83120640d000 0000000000000000 0000000000000000
      (XEN) [  459.021608]    0000000000000000 ffff82d0803144fa 0000000000000002 0000013f46a2a9c6
      (XEN) [  459.021608]    0000000000000001 fffff8000389ac28 fffffe0007388dd0 fffff8000389ac00
      (XEN) [  459.021609]    0000000000000001 00000000ffffffff fffff80003970258 0000006ac6495968
      (XEN) [  459.021610]    0000013f46a2a9c6 0000000000000000 0000013f00000000 0000000000000000
      (XEN) [  459.021610]    0000000000000000 0000000000000000 ffffffff8110d476 0000000000000000
      (XEN) [  459.021611]    0000000000000246 fffffe0007388dd0 0000000000000000 0000000000000000
      (XEN) [  459.021612]    0000000000000000 0000000000000000 0000000000000000 0000e0100000000f
      (XEN) [  459.021612]    ffff83120640d000 0000004f53436000 00000000003506e0 0000000000000000
      (XEN) [  459.021613]    0000000000000000 0000000000000000 0000000000000000
      (XEN) [  459.021613] Xen call trace:
      (XEN) [  459.021614]    [<ffff82d08027bdd2>] R context_switch+0x862/0xdd0
      (XEN) [  459.021616]    [<ffff82d08023d243>] S common/schedule.c#sched_context_switch+0x53/0x170
      (XEN) [  459.021617]    [<ffff82d08023da94>] S common/schedule.c#schedule+0x1e4/0x240
      (XEN) [  459.021618]    [<ffff82d0803087c5>] S vlapic_accept_pic_intr+0x45/0x80
      (XEN) [  459.021619]    [<ffff82d080308873>] S vlapic_has_pending_irq+0x53/0x120
      (XEN) [  459.021621]    [<ffff82d080300440>] S hvm_vcpu_has_pending_irq+0x60/0xb0
      (XEN) [  459.021622]    [<ffff82d08023ef82>] S common/softirq.c#__do_softirq+0x62/0x90
      (XEN) [  459.021623]    [<ffff82d0803144fa>] S svm_stgi_label+0x13/0x18
      (XEN) [  459.021623] 
      (XEN) [  459.021624] 
      (XEN) [  459.021625] ****************************************
      (XEN) [  459.021625] Panic on CPU 15:
      (XEN) [  459.021625] FATAL TRAP: vector = 6 (invalid opcode)
      (XEN) [  459.021626] ****************************************
      (XEN) [  459.021626] 
      (XEN) [  459.021626] Reboot in five seconds...
      (XEN) [  459.021627] Executing kexec image on cpu15
      (XEN) [  459.022640] Shot down all CPUs
      
      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by olivierlambert

        It seems to be related to your CPU having issues.

        Can you describe more on the context for those two logs? Two different crashes on the same machine?

        How often does this occur?

        fdrcrtlF 1 Reply Last reply Reply Quote 0
        • fdrcrtlF Offline
          fdrcrtl @olivierlambert
          last edited by olivierlambert

          Thank you olivierlambert, it's the same host in different time spans. When this happened, I was setting up and playing with an Oracle Linux 8 via console (the other vm is an opnsense)

          Grepping some text i can see:

          ./20220619-112257-UTC/xen.log:(XEN) [73345.480262] FATAL TRAP: vector = 2 (nmi)
          ./20220619-112257-UTC/xen.log:(XEN) [73345.480262] Panic on CPU 13:
          
          ./20220621-145540-UTC/xen.log:(XEN) [  459.021625] FATAL TRAP: vector = 6 (invalid opcode)
          ./20220621-145540-UTC/xen.log:(XEN) [  459.021625] Panic on CPU 15:
          
          ./20220617-133059-UTC/xen.log:(XEN) [ 6286.633327] FATAL TRAP: vector = 2 (nmi)
          ./20220617-133059-UTC/xen.log:(XEN) [ 6286.633327] Panic on CPU 8:
          
          ./20220621-144636-UTC/xen.log:(XEN) [66874.346274] FATAL TRAP: vector = 6 (invalid opcode)
          ./20220621-144636-UTC/xen.log:(XEN) [66874.346273] Panic on CPU 15:
          

          Dunno if matters but wondering about scaling governor, for now I've edited grub removing cpufreq=xen:performance and setting max_cstate=0 and iommu=0

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            That would be interesting to get debug messages to know exactly what's going on.

            Can you manage to reproduce it or it seems very random?

            1 Reply Last reply Reply Quote 0
            • fdrcrtlF Offline
              fdrcrtl
              last edited by

              As I seen, random as I work on it..

              For debug, ask me anything 🙂
              alt text

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                andSmv can you help fdrcrtl to get more debug info? Thanks 🙂

                1 Reply Last reply Reply Quote 0
                • andSmvA Offline
                  andSmv Vates 🪐 XCP-ng Team Xen Guru
                  last edited by

                  Hello, both issues seem to be related to memory corruption.

                  • The first trace is an #NMI exception (one of the causes can be a parity error detected by the HW). Moreover, CPU#12 gets the #MC(machine check) exception. The #MC is triggered by the HW to notify the system software that there's an unrecoverable issue with the HW.
                  • The second one is the invalid opcode in the Xen Hypervisor context. So it means that either the instruction flow is corrupted, or the instruction pointer is corrupted.

                  My hypothesis is:

                  In the first case - the ECC memory error is detected (and reported by HW) which makes the hypervisor panic and stop

                  In the second case - the memory error is not detected (but the memory is still corrupted) but at some point, this corruption provokes the same result on the Xen hypervisor.

                  Can you look with Hetzner guys if there's a way to change memory modules?

                  The other way to validate this hypothesis is to install a different system software (another OS/hypervisor, another version of hypervisor) and see if you experience the same issue.

                  You can also add on Xen command line "ler=true" option. This can give us more traces (leveraged by HW) to check if there's nothing abnormal on software level. I'll probably will need your Xen image with its symbole table (xen-syms-XXX and xen-syms-XXX.map)

                  fdrcrtlF 1 Reply Last reply Reply Quote 2
                  • fdrcrtlF Offline
                    fdrcrtl @andSmv
                    last edited by

                    andSmv I'm blown away by your professionalism, thank you!
                    Today another crash, i'll reverted grub to basic dom0_max_vcpus=4 dom0_vcpus_pin max_cstate=0 and ler=true (hope for another crash within 1-2 days)

                    I'll schedule a deep check/memtest with Hetzner this weekend to see if they can address this issue, I'll keep you updated!

                    PS. Are cpufreq=xen:performance max_cstate=1 iommu=0 a good combination for better performance/stability (no hw passthr)

                    1 Reply Last reply Reply Quote 0
                    • andSmvA Offline
                      andSmv Vates 🪐 XCP-ng Team Xen Guru
                      last edited by andSmv

                      Thank you 🙂 👍 I hope we will quickly pinpoint the issue and find the solution for it.

                      For your command line - I think it's a good thing if you are looking for performances and you have no use of PCI passthrough. Normally IOMMU is not involved if you do not have passthrough-ed devices, but we already experienced some issues on some platforms where IOMMU itself exhibits unstable behavior. So yes - it is better to disable it if you have no use.

                      fdrcrtlF 1 Reply Last reply Reply Quote 0
                      • fdrcrtlF Offline
                        fdrcrtl @andSmv
                        last edited by

                        andSmv three crash in a row just now!
                        grub.cfg: multiboot2 /boot/xen.gz dom0_mem=4096M,max:4096M watchdog ucode=scan dom0_max_vcpus=4 dom0_vcpus_pin ler=true cpufreq=xen:performance max_cstate=1 iommu=0 crashkernel=256M,below=4G console=vga vga=mode-0x0311

                        xen.log:(XEN) [ 711.242947] Panic on CPU 14:
                        xen.log:(XEN) [ 711.242948] FATAL TRAP: vector = 6 (invalid opcode)

                        xen.log:(XEN) [ 854.061272] Panic on CPU 8:
                        xen.log:(XEN) [ 854.061273] FATAL TRAP: vector = 2 (nmi)

                        xen.log:(XEN) [ 556.104951] Panic on CPU 14:
                        xen.log:(XEN) [ 556.104951] FATAL TRAP: vector = 6 (invalid opcode)

                        Dumped crash folder, kdump and .map files (where I could find them), what do you need/where to send? I'll powering off the host now for an extended memtest by hetzner

                        1 Reply Last reply Reply Quote 0
                        • fdrcrtlF Offline
                          fdrcrtl
                          last edited by

                          Update before starting hw test by hetzner, they said "Please note that this server is 5000 series ryzen and it needs at least Linux kernel version 5.1 to run smoothly as it gets proper support from kernel version 5.12 and above. We have seen many problem from customers running kernel version below 5.1"

                          Deep into the rabbit hole: https://bugzilla.kernel.org/show_bug.cgi?id=212087 - As xcpng running on 4.19 😓 ..

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            It's not Linux that is really "running" on the CPU but Xen (since your dom0 is a VM, not the "host").

                            So the idea is to try to find what's causing issues on Xen with this CPU.

                            1 Reply Last reply Reply Quote 0
                            • fdrcrtlF Offline
                              fdrcrtl
                              last edited by fdrcrtl

                              Thanks for the clarification olivierlambert, just seen in the docs: Citrix Hypervisor 8.2, Base version of CentOS in dom0: 7.5, Xen 4.13.1 + patches, Kernel 4.19 + patches

                              Just want to give more info to the support team! Anyway from hetzner perspective is a negative point. Just for info, amd microcode is installed by default? Now the server is under testing, home they find something hw related

                              Update
                              Unfortunately test completed without any errors 😞

                              Your server finished the hardware check test without any hardware related issues. We boot the server back to the installed System. As we recommended try to use kernel version at least 5.1.

                              Summary of the test:

                              -----------------%<-----------------
                              DMESG: Ok
                              CPUFREQ-CHECK: Ok
                              STRESSTEST-CPU-TEMP: Ok
                              FANCHECK: Ok
                              STRESSTEST: Ok
                              MCE-CHECK: Ok

                              HDDTEST S64HNE0T******: Ok
                              HDDTEST S64HNE0T******: Ok

                              -----------------%<-----------------

                              1 Reply Last reply Reply Quote 0
                              • andSmvA Offline
                                andSmv Vates 🪐 XCP-ng Team Xen Guru
                                last edited by

                                Hmm, in the bugzilla thread the guys talk about adjusting SoC voltage and updating the BIOS. It still seems to me to be a HW problem... I will look through the whole thread and I will do some research about possible workarounds in newer Linux kernels for 5000 series ryzen.

                                1 Reply Last reply Reply Quote 1
                                • fdrcrtlF Offline
                                  fdrcrtl
                                  last edited by fdrcrtl

                                  Right andSmv , what i've found so far

                                  • Due wrong voltage reporting in kernel < 5.12, offset voltage had to be higher
                                  • Implementing ZenStates may can help https://forum.level1techs.com/t/overclock-your-ryzen-cpu-from-linux/126025
                                  • Some success from AMD forum: https://community.amd.com/t5/processors/ryzen-5900x-system-constantly-crashing-restarting-whea-logger-id/td-p/423321/page/84
                                  • Some kernel patches neede for ryzen 5000 series: https://unix.stackexchange.com/questions/628222/what-changes-had-to-be-made-to-linux-kernel-in-order-to-support-ryzen-5000-serie

                                  Dont' know if can help but I've added max_cstate=5 and cpufreq=xen:powersave to limit CPU usage and reduce power requirement. Those settings will be system-wide or only to xen?

                                  1 Reply Last reply Reply Quote 0
                                  • andSmvA Offline
                                    andSmv Vates 🪐 XCP-ng Team Xen Guru
                                    last edited by

                                    To be honest, I would put cpufreq=none and max_cstate=0. This should disable the whole CPU P-states and C-states management by Xen. In this way, if there's any bug in firmware ACPI tables (or may be in the way Xen handles them) it would be possible to pinpoint this.

                                    1 Reply Last reply Reply Quote 0
                                    • andSmvA Offline
                                      andSmv Vates 🪐 XCP-ng Team Xen Guru
                                      last edited by

                                      Thank you for all these links! I will look through them (need some time though)

                                      1 Reply Last reply Reply Quote 0
                                      • ron-gR Offline
                                        ron-g
                                        last edited by

                                        FWIW, I was having similar kernel panics on my HP DL380G8 today. Two Xeon E5-2620 2 GHz, microcode version 0x71a. It's happened before, but only on a reboot.

                                        Today, the kernel panics weren't consistent as to which CPU it was. I saw it get as high as CPU 22 and as low as CPU 3.

                                        I was viewing POST via iLO remote console. After about an hour of allowing it to reboot on its own or with my manually resetting it via iLO GUI, I went to my data center and turned on the monitor and switched to the KVM channel the server was on. It came back up then. HTH.

                                        1 Reply Last reply Reply Quote 0
                                        • fdrcrtlF Offline
                                          fdrcrtl
                                          last edited by

                                          Good morning, any update on this?
                                          Meanwhile 60+ days stable with max_cstate=5 cpufreq=xen:powersave

                                          fdrcrtlF 1 Reply Last reply Reply Quote 0
                                          • fdrcrtlF Offline
                                            fdrcrtl @fdrcrtl
                                            last edited by

                                            fdrcrtl andSmv olivierlambert
                                            Bump, ty

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post