Host crash with general protection fault: 0000 [#1] SMP NOPTI
-
Hello,
We have a host that will crash once every couple of months, taking down all VMs on it and rebooting itself after a few minutes.
We are running XCP-ng 8.1 on a Supermicro chassis. Supermicro's own health-check tooling reports the hardware is OK. We have a few other servers with exact same hardware that do not have this issue.
The crash-logs are very very verbose which is good but I am not sure what information is the most relevant. In the dom0.log I see the following which looks to be the cause:
[8463005.075512] WARN: general protection fault: 0000 [#1] SMP NOPTI [8463005.075524] WARN: CPU: 11 PID: 2467 Comm: handler74 Tainted: G O 4.19.0+1 #1 [8463005.075532] WARN: Hardware name: Supermicro SYS-2029TP-HC1R/X11DPT-PS, BIOS 3.0a 01/12/2019 [8463005.075547] WARN: RIP: e030:__tcp_get_metrics+0x2e/0xa0 [8463005.075553] WARN: Code: 00 48 8b 05 ac dc a6 00 89 c9 48 8d 04 c8 48 8b 00 48 85 c0 74 77 45 31 d2 eb 0c 48 8b 00 41 83 c2 01 48 85 c0 74 5c 45 31 c0 <66> 83 78 20 02 4c 8d 48 10 41 0f 95 c0 31 c9 47 8d 44 00 02 44 8b [8463005.075568] WARN: RSP: e02b:ffff8888a56c3bc0 EFLAGS: 00010246 [8463005.075574] WARN: RAX: ff66e90000441f0f RBX: 0000000000000001 RCX: 0000000000000000 [8463005.075581] WARN: RDX: ffffffff820c0640 RSI: ffff8888a56c3bf0 RDI: ffff8888a56c3bd0 [8463005.075587] WARN: RBP: ffff8888a56c3c38 R08: 0000000000000000 R09: ffffffffc050f6f0 [8463005.075594] WARN: R10: 0000000000000001 R11: 000000002207f80a R12: 00000000000000bb [8463005.075601] WARN: R13: ffff8888a13da180 R14: ffff8888a56c3bd0 R15: ffff8888a56c3bf0 [8463005.075617] WARN: FS: 00007f99d58b5700(0000) GS:ffff8888a56c0000(0000) knlGS:0000000000000000 [8463005.075624] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [8463005.075629] WARN: CR2: 00007f83ce8060a0 CR3: 000000089d094000 CR4: 0000000000040660 [8463005.075640] WARN: Call Trace: [8463005.075645] WARN: <IRQ> [8463005.075650] WARN: tcp_get_metrics+0xd2/0x2c0 [8463005.075660] WARN: ? rt_cpu_seq_stop+0x10/0x10 [8463005.075666] WARN: tcp_init_metrics+0x44/0x190 [8463005.075673] WARN: tcp_init_transfer+0x40/0x100 [8463005.075679] WARN: tcp_finish_connect+0x76/0xf0 [8463005.075684] WARN: tcp_rcv_state_process+0x6c3/0xde8 [8463005.075691] WARN: ? sk_filter_trim_cap+0x47/0x220 [8463005.075697] WARN: tcp_v4_do_rcv+0x70/0x1e0 [8463005.075702] WARN: tcp_v4_rcv+0x993/0xa90 [8463005.075710] WARN: ip_local_deliver_finish+0x98/0x1e0 [8463005.075716] WARN: ip_local_deliver+0x6b/0xe0 [8463005.075721] WARN: ? ip_rcv_core.isra.18+0x290/0x290 [8463005.075727] WARN: ip_rcv+0x52/0xd0 [8463005.075731] WARN: ? ip_local_deliver_finish+0x1e0/0x1e0 [8463005.075739] WARN: __netif_receive_skb_one_core+0x52/0x70 [8463005.075746] WARN: process_backlog+0xa3/0x150 [8463005.075751] WARN: net_rx_action+0x2a5/0x3e0 [8463005.075758] WARN: __do_softirq+0xd1/0x28c [8463005.075766] WARN: do_softirq_own_stack+0x2a/0x40 [8463005.075771] WARN: </IRQ> [8463005.075778] WARN: do_softirq+0x4b/0x70 [8463005.075784] WARN: __local_bh_enable_ip+0x57/0x60 [8463005.075794] WARN: ovs_packet_cmd_execute+0x296/0x2c0 [openvswitch] [8463005.075803] WARN: genl_family_rcv_msg+0x1f7/0x3b0 [8463005.075809] WARN: genl_rcv_msg+0x47/0x90 [8463005.075814] WARN: ? genl_family_rcv_msg+0x3b0/0x3b0 [8463005.075820] WARN: netlink_rcv_skb+0xd4/0x110 [8463005.075825] WARN: genl_rcv+0x24/0x40 [8463005.075830] WARN: netlink_unicast+0x182/0x230 [8463005.075836] WARN: netlink_sendmsg+0x2ed/0x3e0 [8463005.075841] WARN: sock_sendmsg+0x36/0x50 [8463005.075846] WARN: ___sys_sendmsg+0x2b5/0x2d0 [8463005.075855] WARN: ? ep_send_events_proc+0x86/0x1a0 [8463005.075860] WARN: ? ep_modify+0x160/0x160 [8463005.075866] WARN: ? ep_scan_ready_list.isra.13+0x1d8/0x200 [8463005.075872] WARN: ? ep_poll+0x1fe/0x3c0 [8463005.075878] WARN: ? _copy_to_user+0x22/0x30 [8463005.075884] WARN: __sys_sendmsg+0x58/0xa0 [8463005.075892] WARN: do_syscall_64+0x4e/0x100 [8463005.075897] WARN: entry_SYSCALL_64_after_hwframe+0x44/0xa9 [8463005.075904] WARN: RIP: 0033:0x7f99da3f0d5d [8463005.075908] WARN: Code: c6 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be f6 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 07 f7 ff ff 48 89 d0 48 83 c4 08 48 3d 01 [8463005.075923] WARN: RSP: 002b:00007f99d585d7f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [8463005.075930] WARN: RAX: ffffffffffffffda RBX: 00007f99d585e630 RCX: 00007f99da3f0d5d [8463005.075937] WARN: RDX: 0000000000000000 RSI: 00007f99d585d850 RDI: 000000000000001b [8463005.075944] WARN: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000001 [8463005.075951] WARN: R10: 00007f99bc001540 R11: 0000000000000293 R12: 0000000002574310 [8463005.075958] WARN: R13: 00007f99d585dcf0 R14: 0000000003aad63e R15: 00007f99d585d850 [8463005.075965] WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport dm_multipath xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sr_mod cdrom sunrpc skx_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper dm_mod uas usb_storage ipmi_si lpc_ich i2c_i801 ipmi_devintf sg ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod megaraid_sas(O) ahci i40e(O) libahci xhci_pci libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt [8463005.076066] WARN: ---[ end trace db0040d21ba45c02 ]---
Does anyone have a clue to what could be done to resolve this or further narrow down where this is coming from?
Cheers,
Niels -
Hi,
You mean XCP-ng 8.2.1, right?
The usual tour: check if all your firmware/BIOS are up-to-date, but that look like a firmware bug to me
-
@NielsH The X11DPT BIOS update has about 100 bug fix/update notes. Also check the hardware SEL (System Event Log). It's a good idea to run a full memory check (should take an hour or two, longer if you have the time). You can boot memtest86+ as a ISO to run a standalone memory check.
-
Hi,
Yes, indeed 8.2.1. Was a typo
I've updated the BIOS to 3.5. We'll have to see since it only happens every few months and I did not find a way to reproduce it.
Andrew: System Event Log (in the IPMI Interface) shows nothing. Memtest/Supermicro health check tool also shows no errors (sadly).
From this error, since it mentions tcp, could it indicate the cause might also be a NIC or NIC firmware?
-
Yes, this is something I would double check: NIC issue or NIC firmware.