XCP-ng

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups

    XcpNG - Xen kernel crash (FATAL TRAP: vector = 2 (nmi))

    Compute
    6
    19
    1740
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambert
      olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼 last edited by olivierlambert

      This is a message coming from your motherboard yes, from the PCI subsystem. I wouldn't be really confident about this hardware, but if you have backup or if it's not in production, whatever 😛

      1 Reply Last reply Reply Quote 0
      • P
        petr.bena last edited by

        It is running one of production CEPH nodes, but it if crashes, CEPH will transparently failover. VMs running there are just for backup and non-prod stuff, if I knew which HW is causing it, I would replace it, but this message isn't very clear on what is really going on.

        Other than that everything is running OK, so far no crash...

        F 1 Reply Last reply Reply Quote 0
        • F
          fbifido @petr.bena last edited by

          @petr-bena You have CEPH running on XCP-ng 8.0 ???
          How many servers are you using with CEPH?
          How did you setup CEPH on xcp-ng 8.0?

          P 1 Reply Last reply Reply Quote 0
          • P
            petr.bena @fbifido last edited by

            @fbifido yes, I have 3 CEPH nodes running in separate VM's that have direct passthrough to underlying physical disks. CEPH volume is connected as RBD that forms shared block device on XCP-ng servers. On that shared block device I use LVM.

            It's all described here: https://github.com/xcp-ng/xcp/wiki/Ceph-on-XCP-ng-7.5-or-later#lvm-on-rbd

            1 Reply Last reply Reply Quote 1
            • dave
              dave last edited by dave

              Hi!

              @petr-bena did you have crashes since your change nmi=dom0 ?

              We have a similar problem.

              There are 4 servers in different locations, two standalone, two of them in pools, all with the same hardware:

              Supermicro X11SRA-RF Version: 1.02
              and
              Intel(R) Xeon(R) W-2145 CPU

              We tried all BIOS Versions and a lot off different settings.

              Two of them are runnig XCP 7.6 and have uptimes of 143 and 160 days. No Problems at all.

              Two of them are running XCP 8.0 and crash regulary between 2 or 30 days, everytime with the same error.

              NMI - PCI system error (SERR)

              The crash is more likely to happen, if we produce high IO and/or network load on those hosts.

              We suspected a hardware error, so we took one of those crashing servers to our workshop and testet it for almost two weeks with Prime95 and Memtest86 and other things that came in mind.

              We were not able to produce any crash. Neither were we able to detect any errors.

              We put this particular server back in production and it crashed within the first hours while we were migrating some VMs back to him. (with Storage Migration)

              So i think, it has something to do with XCP-ng 8.0.

              I will try the change nmi=dom0 next.

              (XEN) [395218.940883] 
              (XEN) [395218.940886] 
              (XEN) [395218.940886] NMI - PCI system error (SERR)
              (XEN) [395218.940889] ----[ Xen-4.11.1-7.8.xcpng8.0  x86_64  debug=n   Not tainted ]----
              (XEN) [395218.940889] CPU:    0
              (XEN) [395218.940890] RIP:    e008:[<ffff82d0802c6d38>] mwait_idle_with_hints+0xf8/0x160
              (XEN) [395218.940894] RFLAGS: 0000000000000046   CONTEXT: hypervisor
              (XEN) [395218.940896] rax: 0000000000000001   rbx: 000167730fc96b09   rcx: 0000000000000001
              (XEN) [395218.940897] rdx: 0000000000000000   rsi: ffff83006f667ef8   rdi: ffff83006f667fff
              (XEN) [395218.940898] rbp: 0000000000000000   rsp: ffff83006f667e00   r8:  0000000000000048
              (XEN) [395218.940899] r9:  000530dec0dbea3e   r10: 0000000000000008   r11: ffff83207cac1a68
              (XEN) [395218.940900] r12: 0000000000000000   r13: 0000000000000001   r14: 0000000000000001
              (XEN) [395218.940902] r15: ffff82d080573d00   cr0: 0000000080050033   cr4: 0000000000362660
              (XEN) [395218.940903] cr3: 00000012c899a000   cr2: ffffe783981eb000
              (XEN) [395218.940904] fsb: 0000000000000000   gsb: ffff88827bf40000   gss: 0000000000000000
              (XEN) [395218.940906] ds: 002b   es: 002b   fs: 0000   gs: 0000   ss: e010   cs: e008
              (XEN) [395218.940908] Xen code around <ffff82d0802c6d38> (mwait_idle_with_hints+0xf8/0x160):
              (XEN) [395218.940908]  89 f0 44 89 e9 0f 01 c9 <0f> b6 47 f5 80 a6 fd 00 00 00 fe 44 89 c1 0f 30
              (XEN) [395218.940912] Xen stack trace from rsp=ffff83006f667e00:
              (XEN) [395218.940913]    ffff83207cac4f08 0000000000000000 ffff83207cac4e90 ffff82d080573d00
              (XEN) [395218.940915]    ffff82d0805baa50 ffff82d080592b20 ffff83207cac4f08 ffff82d0802ccd07
              (XEN) [395218.940916]    000167730faf9015 0000000100000002 00000108000004c9 0000000000000000
              (XEN) [395218.940918]    0000000000000000 ffff82d08035b43e ffff83006f7fc000 ffffffffffffffff
              (XEN) [395218.940919]    ffff82d08035b400 ffff82d080573d00 ffff82d0805baa50 0000000000000000
              (XEN) [395218.940921]    0000000000000000 ffff82d080592b20 ffff83006f667fff ffff82d08026e505
              (XEN) [395218.940922]    ffff83006f7fc000 ffff83006f7fc000 ffff83006f7bf000 ffff83207cb69000
              (XEN) [395218.940924]    00000000ffffffff ffff8320246cc000 ffff82d080592b20 ffff88827ae3d700
              (XEN) [395218.940926]    ffff88827ae3d700 0000000000000000 0000000000000000 0000000000000005
              (XEN) [395218.940927]    ffff88827ae3d700 0000000000000246 ffffc9004106b930 0000000000000000
              (XEN) [395218.940928]    000000000001ca00 0000000000000000 ffffffff810013aa ffffffff8203c190
              (XEN) [395218.940930]    0000000000000000 0000000000000001 0000010000000000 ffffffff810013aa
              (XEN) [395218.940931]    000000000000e033 0000000000000246 ffffc90040113eb0 000000000000e02b
              (XEN) [395218.940933]    6f5b7c2b6f667fe0 6f5b7cae00097f76 6f5b7da200000000 6f5b79516f667fe0
              (XEN) [395218.940934]    0000e01000000000 ffff83006f7fc000 0000000000000000 0000000000362660
              (XEN) [395218.940936]    0000000000000000 800000207caef002 0000070100000000 6f5b883e00097f00
              (XEN) [395218.940938] Xen call trace:
              (XEN) [395218.940939]    [<ffff82d0802c6d38>] mwait_idle_with_hints+0xf8/0x160
              (XEN) [395218.940942]    [<ffff82d0802ccd07>] mwait-idle.c#mwait_idle+0x337/0x3d0
              (XEN) [395218.940945]    [<ffff82d08035b43e>] lstar_enter+0xae/0x120
              (XEN) [395218.940946]    [<ffff82d08035b400>] lstar_enter+0x70/0x120
              (XEN) [395218.940950]    [<ffff82d08026e505>] domain.c#idle_loop+0x85/0xb0
              (XEN) [395218.940951] 
              (XEN) [395218.940952] 
              (XEN) [395218.940953] ****************************************
              (XEN) [395218.940953] Panic on CPU 0:
              (XEN) [395218.940954] FATAL TRAP: vector = 2 (nmi)
              (XEN) [395218.940955] [error_code=0000] , IN INTERRUPT CONTEXT
              (XEN) [395218.940955] ****************************************
              (XEN) [395218.940956] 
              (XEN) [395218.940956] Reboot in five seconds...
              (XEN) [395218.940958] Executing kexec image on cpu0
              (XEN) [395218.941963] Shot down all CPUs
              
              1 Reply Last reply Reply Quote 1
              • P
                petr.bena last edited by

                Hello, no, since I changed this, server is rock solid:

                20:59:01 up 136 days, 22:40, 1 user, load average: 0.45, 0.31, 0.36

                1 Reply Last reply Reply Quote 1
                • olivierlambert
                  olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼 last edited by

                  @dave you should try with 8.1 beta

                  1 Reply Last reply Reply Quote 0
                  • dave
                    dave last edited by

                    @petr-bena Thanks.

                    I can confirm: Until now everything is stable for us, too. ( with nmi=dom0 )

                    @olivierlambert Since i have only production-servers with the affected hardware ATM, i cant test the 8.1 beta right now. But after relase i will try 8.1 final. Do you think there is a real chance that this error wont appear in 8.1 stock? Or should I do the same change?

                    1 Reply Last reply Reply Quote 1
                    • olivierlambert
                      olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼 last edited by

                      8.1 is bundled with latest and greated Xen, 4.13. So yeah, it might change (eg if it's a bug fixed in a more recent Xen version). Otherwise, keep nmi configuration as it 🙂

                      1 Reply Last reply Reply Quote 0
                      • M
                        mauricio_hps last edited by

                        Hi ! Excusme for my bad English. I´ve installed Xen Server 7.2 for fist time in my lyfe and it crash with FATAL TRAP:vector = 2 (nime)).
                        How edit boot xen boot parameter and add nmi=dom0 ?
                        Thanks !

                        1 Reply Last reply Reply Quote 0
                        • olivierlambert
                          olivierlambert Vates 🪐 Co-Founder🦸 CEO 🧑‍💼 last edited by

                          Hi @mauricio_hps

                          This is a XCP-ng forum, please try with XCP-ng 😉 https://xcp-ng.org

                          1 Reply Last reply Reply Quote 1
                          • First post
                            Last post