XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Host crash with general protection fault: 0000 [#1] SMP NOPTI

    Scheduled Pinned Locked Moved Compute
    5 Posts 3 Posters 1.5k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • N Offline
      NielsH
      last edited by

      Hello,

      We have a host that will crash once every couple of months, taking down all VMs on it and rebooting itself after a few minutes.

      We are running XCP-ng 8.1 on a Supermicro chassis. Supermicro's own health-check tooling reports the hardware is OK. We have a few other servers with exact same hardware that do not have this issue.

      The crash-logs are very very verbose which is good but I am not sure what information is the most relevant. In the dom0.log I see the following which looks to be the cause:

      [8463005.075512]   WARN: general protection fault: 0000 [#1] SMP NOPTI
      [8463005.075524]   WARN: CPU: 11 PID: 2467 Comm: handler74 Tainted: G           O      4.19.0+1 #1
      [8463005.075532]   WARN: Hardware name: Supermicro SYS-2029TP-HC1R/X11DPT-PS, BIOS 3.0a 01/12/2019
      [8463005.075547]   WARN: RIP: e030:__tcp_get_metrics+0x2e/0xa0
      [8463005.075553]   WARN: Code: 00 48 8b 05 ac dc a6 00 89 c9 48 8d 04 c8 48 8b 00 48 85 c0 74 77 45 31 d2 eb 0c 48 8b 00 41 83 c2 01 48 85 c0 74 5c 45 31 c0 <66> 83 78 20 02 4c 8d 48 10 41 0f 95 c0 31 c9 47 8d 44 00 02 44 8b
      [8463005.075568]   WARN: RSP: e02b:ffff8888a56c3bc0 EFLAGS: 00010246
      [8463005.075574]   WARN: RAX: ff66e90000441f0f RBX: 0000000000000001 RCX: 0000000000000000
      [8463005.075581]   WARN: RDX: ffffffff820c0640 RSI: ffff8888a56c3bf0 RDI: ffff8888a56c3bd0
      [8463005.075587]   WARN: RBP: ffff8888a56c3c38 R08: 0000000000000000 R09: ffffffffc050f6f0
      [8463005.075594]   WARN: R10: 0000000000000001 R11: 000000002207f80a R12: 00000000000000bb
      [8463005.075601]   WARN: R13: ffff8888a13da180 R14: ffff8888a56c3bd0 R15: ffff8888a56c3bf0
      [8463005.075617]   WARN: FS:  00007f99d58b5700(0000) GS:ffff8888a56c0000(0000) knlGS:0000000000000000
      [8463005.075624]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
      [8463005.075629]   WARN: CR2: 00007f83ce8060a0 CR3: 000000089d094000 CR4: 0000000000040660
      [8463005.075640]   WARN: Call Trace:
      [8463005.075645]   WARN:  <IRQ>
      [8463005.075650]   WARN:  tcp_get_metrics+0xd2/0x2c0
      [8463005.075660]   WARN:  ? rt_cpu_seq_stop+0x10/0x10
      [8463005.075666]   WARN:  tcp_init_metrics+0x44/0x190
      [8463005.075673]   WARN:  tcp_init_transfer+0x40/0x100
      [8463005.075679]   WARN:  tcp_finish_connect+0x76/0xf0
      [8463005.075684]   WARN:  tcp_rcv_state_process+0x6c3/0xde8
      [8463005.075691]   WARN:  ? sk_filter_trim_cap+0x47/0x220
      [8463005.075697]   WARN:  tcp_v4_do_rcv+0x70/0x1e0
      [8463005.075702]   WARN:  tcp_v4_rcv+0x993/0xa90
      [8463005.075710]   WARN:  ip_local_deliver_finish+0x98/0x1e0
      [8463005.075716]   WARN:  ip_local_deliver+0x6b/0xe0
      [8463005.075721]   WARN:  ? ip_rcv_core.isra.18+0x290/0x290
      [8463005.075727]   WARN:  ip_rcv+0x52/0xd0
      [8463005.075731]   WARN:  ? ip_local_deliver_finish+0x1e0/0x1e0
      [8463005.075739]   WARN:  __netif_receive_skb_one_core+0x52/0x70
      [8463005.075746]   WARN:  process_backlog+0xa3/0x150
      [8463005.075751]   WARN:  net_rx_action+0x2a5/0x3e0
      [8463005.075758]   WARN:  __do_softirq+0xd1/0x28c
      [8463005.075766]   WARN:  do_softirq_own_stack+0x2a/0x40
      [8463005.075771]   WARN:  </IRQ>
      [8463005.075778]   WARN:  do_softirq+0x4b/0x70
      [8463005.075784]   WARN:  __local_bh_enable_ip+0x57/0x60
      [8463005.075794]   WARN:  ovs_packet_cmd_execute+0x296/0x2c0 [openvswitch]
      [8463005.075803]   WARN:  genl_family_rcv_msg+0x1f7/0x3b0
      [8463005.075809]   WARN:  genl_rcv_msg+0x47/0x90
      [8463005.075814]   WARN:  ? genl_family_rcv_msg+0x3b0/0x3b0
      [8463005.075820]   WARN:  netlink_rcv_skb+0xd4/0x110
      [8463005.075825]   WARN:  genl_rcv+0x24/0x40
      [8463005.075830]   WARN:  netlink_unicast+0x182/0x230
      [8463005.075836]   WARN:  netlink_sendmsg+0x2ed/0x3e0
      [8463005.075841]   WARN:  sock_sendmsg+0x36/0x50
      [8463005.075846]   WARN:  ___sys_sendmsg+0x2b5/0x2d0
      [8463005.075855]   WARN:  ? ep_send_events_proc+0x86/0x1a0
      [8463005.075860]   WARN:  ? ep_modify+0x160/0x160
      [8463005.075866]   WARN:  ? ep_scan_ready_list.isra.13+0x1d8/0x200
      [8463005.075872]   WARN:  ? ep_poll+0x1fe/0x3c0
      [8463005.075878]   WARN:  ? _copy_to_user+0x22/0x30
      [8463005.075884]   WARN:  __sys_sendmsg+0x58/0xa0
      [8463005.075892]   WARN:  do_syscall_64+0x4e/0x100
      [8463005.075897]   WARN:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [8463005.075904]   WARN: RIP: 0033:0x7f99da3f0d5d
      [8463005.075908]   WARN: Code: c6 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be f6 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 07 f7 ff ff 48 89 d0 48 83 c4 08 48 3d 01
      [8463005.075923]   WARN: RSP: 002b:00007f99d585d7f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
      [8463005.075930]   WARN: RAX: ffffffffffffffda RBX: 00007f99d585e630 RCX: 00007f99da3f0d5d
      [8463005.075937]   WARN: RDX: 0000000000000000 RSI: 00007f99d585d850 RDI: 000000000000001b
      [8463005.075944]   WARN: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000001
      [8463005.075951]   WARN: R10: 00007f99bc001540 R11: 0000000000000293 R12: 0000000002574310
      [8463005.075958]   WARN: R13: 00007f99d585dcf0 R14: 0000000003aad63e R15: 00007f99d585d850
      [8463005.075965]   WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport dm_multipath xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sr_mod cdrom sunrpc skx_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper dm_mod uas usb_storage ipmi_si lpc_ich i2c_i801 ipmi_devintf sg ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod megaraid_sas(O) ahci i40e(O) libahci xhci_pci libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
      [8463005.076066]   WARN: ---[ end trace db0040d21ba45c02 ]---
      

      Does anyone have a clue to what could be done to resolve this or further narrow down where this is coming from?

      Cheers,
      Niels

      A 1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        You mean XCP-ng 8.2.1, right?

        The usual tour: check if all your firmware/BIOS are up-to-date, but that look like a firmware bug to me 🤔

        1 Reply Last reply Reply Quote 0
        • A Online
          Andrew Top contributor @NielsH
          last edited by

          @NielsH The X11DPT BIOS update has about 100 bug fix/update notes. Also check the hardware SEL (System Event Log). It's a good idea to run a full memory check (should take an hour or two, longer if you have the time). You can boot memtest86+ as a ISO to run a standalone memory check.

          1 Reply Last reply Reply Quote 0
          • N Offline
            NielsH
            last edited by

            Hi,

            Yes, indeed 8.2.1. Was a typo 🙂

            I've updated the BIOS to 3.5. We'll have to see since it only happens every few months and I did not find a way to reproduce it.

            Andrew: System Event Log (in the IPMI Interface) shows nothing. Memtest/Supermicro health check tool also shows no errors (sadly).

            From this error, since it mentions tcp, could it indicate the cause might also be a NIC or NIC firmware?

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Yes, this is something I would double check: NIC issue or NIC firmware.

              1 Reply Last reply Reply Quote 0
              • First post
                Last post