Host crash with general protection fault: 0000 [#1] SMP NOPTI
-
Hello,
We have a host that will crash once every couple of months, taking down all VMs on it and rebooting itself after a few minutes.
We are running XCP-ng 8.1 on a Supermicro chassis. Supermicro's own health-check tooling reports the hardware is OK. We have a few other servers with exact same hardware that do not have this issue.
The crash-logs are very very verbose which is good but I am not sure what information is the most relevant. In the dom0.log I see the following which looks to be the cause:
[8463005.075512] WARN: general protection fault: 0000 [#1] SMP NOPTI [8463005.075524] WARN: CPU: 11 PID: 2467 Comm: handler74 Tainted: G O 4.19.0+1 #1 [8463005.075532] WARN: Hardware name: Supermicro SYS-2029TP-HC1R/X11DPT-PS, BIOS 3.0a 01/12/2019 [8463005.075547] WARN: RIP: e030:__tcp_get_metrics+0x2e/0xa0 [8463005.075553] WARN: Code: 00 48 8b 05 ac dc a6 00 89 c9 48 8d 04 c8 48 8b 00 48 85 c0 74 77 45 31 d2 eb 0c 48 8b 00 41 83 c2 01 48 85 c0 74 5c 45 31 c0 <66> 83 78 20 02 4c 8d 48 10 41 0f 95 c0 31 c9 47 8d 44 00 02 44 8b [8463005.075568] WARN: RSP: e02b:ffff8888a56c3bc0 EFLAGS: 00010246 [8463005.075574] WARN: RAX: ff66e90000441f0f RBX: 0000000000000001 RCX: 0000000000000000 [8463005.075581] WARN: RDX: ffffffff820c0640 RSI: ffff8888a56c3bf0 RDI: ffff8888a56c3bd0 [8463005.075587] WARN: RBP: ffff8888a56c3c38 R08: 0000000000000000 R09: ffffffffc050f6f0 [8463005.075594] WARN: R10: 0000000000000001 R11: 000000002207f80a R12: 00000000000000bb [8463005.075601] WARN: R13: ffff8888a13da180 R14: ffff8888a56c3bd0 R15: ffff8888a56c3bf0 [8463005.075617] WARN: FS: 00007f99d58b5700(0000) GS:ffff8888a56c0000(0000) knlGS:0000000000000000 [8463005.075624] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 [8463005.075629] WARN: CR2: 00007f83ce8060a0 CR3: 000000089d094000 CR4: 0000000000040660 [8463005.075640] WARN: Call Trace: [8463005.075645] WARN: <IRQ> [8463005.075650] WARN: tcp_get_metrics+0xd2/0x2c0 [8463005.075660] WARN: ? rt_cpu_seq_stop+0x10/0x10 [8463005.075666] WARN: tcp_init_metrics+0x44/0x190 [8463005.075673] WARN: tcp_init_transfer+0x40/0x100 [8463005.075679] WARN: tcp_finish_connect+0x76/0xf0 [8463005.075684] WARN: tcp_rcv_state_process+0x6c3/0xde8 [8463005.075691] WARN: ? sk_filter_trim_cap+0x47/0x220 [8463005.075697] WARN: tcp_v4_do_rcv+0x70/0x1e0 [8463005.075702] WARN: tcp_v4_rcv+0x993/0xa90 [8463005.075710] WARN: ip_local_deliver_finish+0x98/0x1e0 [8463005.075716] WARN: ip_local_deliver+0x6b/0xe0 [8463005.075721] WARN: ? ip_rcv_core.isra.18+0x290/0x290 [8463005.075727] WARN: ip_rcv+0x52/0xd0 [8463005.075731] WARN: ? ip_local_deliver_finish+0x1e0/0x1e0 [8463005.075739] WARN: __netif_receive_skb_one_core+0x52/0x70 [8463005.075746] WARN: process_backlog+0xa3/0x150 [8463005.075751] WARN: net_rx_action+0x2a5/0x3e0 [8463005.075758] WARN: __do_softirq+0xd1/0x28c [8463005.075766] WARN: do_softirq_own_stack+0x2a/0x40 [8463005.075771] WARN: </IRQ> [8463005.075778] WARN: do_softirq+0x4b/0x70 [8463005.075784] WARN: __local_bh_enable_ip+0x57/0x60 [8463005.075794] WARN: ovs_packet_cmd_execute+0x296/0x2c0 [openvswitch] [8463005.075803] WARN: genl_family_rcv_msg+0x1f7/0x3b0 [8463005.075809] WARN: genl_rcv_msg+0x47/0x90 [8463005.075814] WARN: ? genl_family_rcv_msg+0x3b0/0x3b0 [8463005.075820] WARN: netlink_rcv_skb+0xd4/0x110 [8463005.075825] WARN: genl_rcv+0x24/0x40 [8463005.075830] WARN: netlink_unicast+0x182/0x230 [8463005.075836] WARN: netlink_sendmsg+0x2ed/0x3e0 [8463005.075841] WARN: sock_sendmsg+0x36/0x50 [8463005.075846] WARN: ___sys_sendmsg+0x2b5/0x2d0 [8463005.075855] WARN: ? ep_send_events_proc+0x86/0x1a0 [8463005.075860] WARN: ? ep_modify+0x160/0x160 [8463005.075866] WARN: ? ep_scan_ready_list.isra.13+0x1d8/0x200 [8463005.075872] WARN: ? ep_poll+0x1fe/0x3c0 [8463005.075878] WARN: ? _copy_to_user+0x22/0x30 [8463005.075884] WARN: __sys_sendmsg+0x58/0xa0 [8463005.075892] WARN: do_syscall_64+0x4e/0x100 [8463005.075897] WARN: entry_SYSCALL_64_after_hwframe+0x44/0xa9 [8463005.075904] WARN: RIP: 0033:0x7f99da3f0d5d [8463005.075908] WARN: Code: c6 20 00 00 75 10 b8 2e 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be f6 ff ff 48 89 04 24 b8 2e 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 07 f7 ff ff 48 89 d0 48 83 c4 08 48 3d 01 [8463005.075923] WARN: RSP: 002b:00007f99d585d7f0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e [8463005.075930] WARN: RAX: ffffffffffffffda RBX: 00007f99d585e630 RCX: 00007f99da3f0d5d [8463005.075937] WARN: RDX: 0000000000000000 RSI: 00007f99d585d850 RDI: 000000000000001b [8463005.075944] WARN: RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000001 [8463005.075951] WARN: R10: 00007f99bc001540 R11: 0000000000000293 R12: 0000000002574310 [8463005.075958] WARN: R13: 00007f99d585dcf0 R14: 0000000003aad63e R15: 00007f99d585d850 [8463005.075965] WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport dm_multipath xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sr_mod cdrom sunrpc skx_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper dm_mod uas usb_storage ipmi_si lpc_ich i2c_i801 ipmi_devintf sg ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod megaraid_sas(O) ahci i40e(O) libahci xhci_pci libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt [8463005.076066] WARN: ---[ end trace db0040d21ba45c02 ]---Does anyone have a clue to what could be done to resolve this or further narrow down where this is coming from?
Cheers,
Niels -
Hi,
You mean XCP-ng 8.2.1, right?
The usual tour: check if all your firmware/BIOS are up-to-date, but that look like a firmware bug to me

-
@NielsH The X11DPT BIOS update has about 100 bug fix/update notes. Also check the hardware SEL (System Event Log). It's a good idea to run a full memory check (should take an hour or two, longer if you have the time). You can boot memtest86+ as a ISO to run a standalone memory check.
-
Hi,
Yes, indeed 8.2.1. Was a typo

I've updated the BIOS to 3.5. We'll have to see since it only happens every few months and I did not find a way to reproduce it.
Andrew: System Event Log (in the IPMI Interface) shows nothing. Memtest/Supermicro health check tool also shows no errors (sadly).
From this error, since it mentions tcp, could it indicate the cause might also be a NIC or NIC firmware?
-
Yes, this is something I would double check: NIC issue or NIC firmware.
Hello! It looks like you're interested in this conversation, but you don't have an account yet.
Getting fed up of having to scroll through the same posts each visit? When you register for an account, you'll always come back to exactly where you were before, and choose to be notified of new replies (either via email, or push notification). You'll also be able to save bookmarks and upvote posts to show your appreciation to other community members.
With your input, this post could be even better 💗
Register Login