Kernel panic on fresh install
-
I think packets inside WG are using a smaller MTU because they must contain VPN keys and such.
You might wonder how I guessed that you are using BSD VMs + VPNs: in fact, you are not the first to report the problem. We even had it at some point here (but we never managed to reproduce).
What we know:
- it's an OVS bug
- it's happening when you use BSD like VMs
- and also probably VPNs
So we suspect a packet that OVS can't decode without exploding. However, it's hard to move forward without being able to reproduce it so we can find exactly what packets is doing this
-
@olivierlambert will my complete crash dump directory would help you ?
Another thing that's annoying me is that the VMs on the host don't start even with the autostart option checked.
-
For autostart: just disable/re-enable it in XO, that will do the trick.
The crash dump is sadly not enough to know what packet actually cause OVS to crash
-
@olivierlambert Is it because some packets size from wireguard are too high ? Should I lower the MTU of the wireguard servers & clients to be lower than the MTU of the host interface ?
-
Sadly, I don't know exactly what's causing it. Ideally, if you can find a way to trigger it on purpose, that would be wonderful.
-
Same here. XCP-ng started to crash unexpectedly in last 3 month with no obvious reason with similar crash log:
[ 214278.799922] ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 214278.799944] INFO: PGD 0 P4D 0 [ 214278.799956] WARN: Oops: 0000 [#1] SMP NOPTI [ 214278.799967] WARN: CPU: 4 PID: 0 Comm: swapper/4 Tainted: G O 4.19.0+1 #1 [ 214278.799976] WARN: Hardware name: Quanta Cloud Technology Inc. QuantaPlex T22HF-1U/S5HF MB, BIOS 3A05.ON02 03/20/2019 [ 214278.799994] WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0 [ 214278.800001] WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff [ 214278.800017] WARN: RSP: e02b:ffff888235303668 EFLAGS: 00010282 [ 214278.800025] WARN: RAX: ffff8880a540dce0 RBX: 0000000000000000 RCX: 00000000000000c0 [ 214278.800033] WARN: RDX: ffff8880a540dcc0 RSI: ffff8880a540dcc0 RDI: ffffea0008ff6380 [ 214278.800042] WARN: RBP: 0000000000000000 R08: ffff8880a540dc00 R09: 0000000000000001 [ 214278.800050] WARN: R10: 0000000000000259 R11: ffff88812597f540 R12: ffff888042ae1d00 [ 214278.800057] WARN: R13: 0000000000000000 R14: ffff888122bec8c0 R15: 0000000000000000 [ 214278.800079] WARN: FS: 0000000000000000(0000) GS:ffff888235300000(0000) knlGS:0000000000000000 [ 214278.800088] WARN: CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 214278.800095] WARN: CR2: 0000000000000008 CR3: 00000001f049e000 CR4: 0000000000040660 [ 214278.800105] WARN: Call Trace: [ 214278.800113] WARN: <IRQ> [ 214278.800129] WARN: tun_net_xmit+0x3de/0x460 [tun] [ 214278.800140] WARN: dev_hard_start_xmit+0xa4/0x210 [ 214278.800151] WARN: sch_direct_xmit+0x10d/0x350 [ 214278.800159] WARN: __qdisc_run+0x167/0x4e0 [ 214278.800167] WARN: ? pfifo_fast_enqueue+0x92/0xf0 [ 214278.800176] WARN: __dev_queue_xmit+0x511/0x900 [ 214278.800189] WARN: do_execute_actions+0x157f/0x1750 [openvswitch] [ 214278.800203] WARN: ? __wake_up_common_lock+0x87/0xc0 [ 214278.800214] WARN: ? __raw_callee_save_xen_vcpu_stolen+0x11/0x20 [ 214278.800226] WARN: ? __radix_tree_lookup+0x80/0xf0 [ 214278.800237] WARN: ovs_execute_actions+0x47/0x120 [openvswitch] [ 214278.800249] WARN: ovs_dp_process_packet+0x7d/0x110 [openvswitch] [ 214278.800261] WARN: ? key_extract+0xa53/0xd60 [openvswitch] [ 214278.800274] WARN: ovs_vport_receive+0x6e/0xd0 [openvswitch] [ 214278.800285] WARN: ? hrtimer_init+0x190/0x190 [ 214278.800294] WARN: ? xen_vcpuop_set_next_event+0x69/0xa0 [ 214278.800302] WARN: ? __alloc_skb+0x76/0x270 [ 214278.800312] WARN: ? arch_local_irq_restore+0x5/0x10 [ 214278.800320] WARN: ? __slab_alloc.constprop.81+0x42/0x4e [ 214278.800327] WARN: ? __alloc_skb+0x76/0x270 [ 214278.800334] WARN: ? __kmalloc_track_caller+0x195/0x200 [ 214278.800343] WARN: ? __kmalloc_reserve.isra.48+0x29/0x70 [ 214278.800357] WARN: netdev_frame_hook+0x105/0x180 [openvswitch] [ 214278.800367] WARN: __netif_receive_skb_core+0x211/0xb30 [ 214278.800377] WARN: __netif_receive_skb_one_core+0x36/0x70 [ 214278.800385] WARN: netif_receive_skb_internal+0x34/0xe0 [ 214278.800396] WARN: xenvif_tx_action+0x4b8/0x900 [ 214278.800406] WARN: xenvif_poll+0x27/0x70 [ 214278.800416] WARN: net_rx_action+0x2a5/0x3e0 [ 214278.800427] WARN: __do_softirq+0xd1/0x28c [ 214278.800438] WARN: irq_exit+0xa8/0xc0 [ 214278.800448] WARN: xen_evtchn_do_upcall+0x2c/0x50 [ 214278.800459] WARN: xen_do_hypervisor_callback+0x29/0x40 [ 214278.800468] WARN: </IRQ> [ 214278.800477] WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20 [ 214278.800485] WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [ 214278.800502] WARN: RSP: e02b:ffffc900400b3eb0 EFLAGS: 00000246 [ 214278.800510] WARN: RAX: 0000000000000000 RBX: ffff88822c239d00 RCX: ffffffff810013aa [ 214278.800519] WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001 [ 214278.800528] WARN: RBP: 0000000000000004 R08: 000000000001ca00 R09: 0000000000000000 [ 214278.800537] WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000 [ 214278.800545] WARN: R13: 0000000000000000 R14: ffff88822c239d00 R15: ffff88822c239d00 [ 214278.800557] WARN: ? xen_hypercall_sched_op+0xa/0x20 [ 214278.800567] WARN: ? xen_safe_halt+0xc/0x20 [ 214278.800576] WARN: ? default_idle+0x1a/0x140 [ 214278.800585] WARN: ? do_idle+0x1ea/0x260 [ 214278.800594] WARN: ? cpu_startup_entry+0x6f/0x80 [ 214278.800602] WARN: Modules linked in: tun rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc 8021q garp mrp stp llc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat dm_multipath i pt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter nls_iso8859_1 nls_cp437 vfat fat raid0 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel md_mod pcbc aesni_in tel dm_mod aes_x86_64 crypto_simd cryptd glue_helper i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables ahci libahci nvme xhci_pci libata nvme_core xhci_hcd i xgbe(O) scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt [ 214278.800746] WARN: CR2: 0000000000000008 [ 214278.800768] WARN: ---[ end trace 0f1c8a4f455bc1b3 ]--- [ 214280.803918] WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0 [ 214280.803947] WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff [ 214280.803971] WARN: RSP: e02b:ffff888235303668 EFLAGS: 00010282 [ 214280.803977] WARN: RAX: ffff8880a540dce0 RBX: 0000000000000000 RCX: 00000000000000c0 [ 214280.803982] WARN: RDX: ffff8880a540dcc0 RSI: ffff8880a540dcc0 RDI: ffffea0008ff6380 [ 214280.803987] WARN: RBP: 0000000000000000 R08: ffff8880a540dc00 R09: 0000000000000001 [ 214280.803992] WARN: R10: 0000000000000259 R11: ffff88812597f540 R12: ffff888042ae1d00 [ 214280.804003] WARN: R13: 0000000000000000 R14: ffff888122bec8c0 R15: 0000000000000000 [ 214280.804018] WARN: FS: 0000000000000000(0000) GS:ffff888235300000(0000) knlGS:0000000000000000 [ 214280.804027] WARN: CS: e033 DS: 002b ES: 002b CR0: 0000000080050033 [ 214280.804032] WARN: CR2: 0000000000000008 CR3: 00000001f049e000 CR4: 0000000000040660 [ 214280.804040] EMERG: Kernel panic - not syncing: Fatal exception in interrupt
In my setup I run opnsense (yes, FreeBSD based) on top for firewall/VPN (WG, OpenVPN). I'll check for MTU and reply here.
-
lspci on host:
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01) 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
MTU on OpnSense:
LAN interface (lan, xn1) Status up MTU 1500
RoadWarriorWG0 interface (opt1, wg0) Status up MTU 16304
WAN interface (wan, xn0) Status up MTU 1500
site2siteWG1 interface (opt2, wg1)Status up MTU 1420
xcp interface (opt3, xn2) Status up MTU 1500
Unassigned interface (lo0) MTU 16384
Unassigned interface (enc0) Status down MTU 1536
Unassigned interface (pflog0) Status down MTU 33160
Unassigned interface (ovpns1) Status up MTU 1500
-
We just released new patches that might solve this. Please update and keep us posted next time you have the problem
-
@olivierlambert
Thank you for quick fix!
I'll leave it w/o patch for couple of days to see how it works with reduced MTU on WireGuard interface before applying patch. Just in case MTU will do the trick. -
@olivierlambert Thanks ! Will try it as soon as I'm off vacation in early september.
-
@olivierlambert Server just rebooted with same error. Updates installed and waiting for next reboot
Log file almost identical:
Main differences between previous 2 days ago:
oldcode_textWARN: CPU: 4 PID: 0 Comm: swapper/4 Tainted: G O 4.19.0+1 #1
new
code_textWARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G O 4.19.0+1 #1
-
So you had the same issue after all the last updates + a "manual" reboot?
-
@olivierlambert said in Kernel panic on fresh install:
So you had the same issue after all the last updates + a "manual" reboot?
No reboot after updates. Just a mention, that in my case reducing MTU on heavy-utilised Wireguard interface didn't help.
Also these reboots completely unpredictable, sometimes during busy day, but more often during night hours where only backups can run.
-
Okay so now you have the updates really installed, we'll see if it happens
-
@olivierlambert said in Kernel panic on fresh install:
Okay so now you have the updates really installed, we'll see if it happens
Just got crash reboot again, while trying to restart a VM from XCP-center from another VM in the pool. This reboot should apply all patches from last updates.
-
Just had two consecutive crashes 15 minutes apart.
This is comparison between old crash before updates and after latest updates.
-
New crash, same message...
-
It's weird, OVS is not involved. So it might be something else
Any chance you know how to trigger it artificially? That would be really helpful to pinpoint the issue.
-
@olivierlambert
I would love to! Only thing I can say - is when I was using windows XCP-console from VM inside the pool, once starting VM caused whole system to crash with this error, another day changing VAPP config also crashed server. That is why I am trying to avoid using xcp-console during business time. I'll try to reproduce it once again and reply. -
@sasha It's worth notice that the BIOS (from 2019) is relatively old/outdated. It's recommended to update the BIOS to a more recent version.