Kernel panic on fresh install

olivierlambert

I mean the physical NIC in your host. "NIC type" doesn't matter as soon your OS booted, it will switch to Xen PV NICs.

Are you using Wireguard and/or any VPN that might use a different MTU than 1500?

JEDIBC

@olivierlambert Here's the NIb brand/model : Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe

Yes we use wireguard but the MTU is not specified so it should be 1500 :

olivierlambert

I think packets inside WG are using a smaller MTU because they must contain VPN keys and such.

You might wonder how I guessed that you are using BSD VMs + VPNs: in fact, you are not the first to report the problem. We even had it at some point here (but we never managed to reproduce).

What we know:

it's an OVS bug
it's happening when you use BSD like VMs
and also probably VPNs

So we suspect a packet that OVS can't decode without exploding. However, it's hard to move forward without being able to reproduce it so we can find exactly what packets is doing this

JEDIBC

@olivierlambert will my complete crash dump directory would help you ?

Another thing that's annoying me is that the VMs on the host don't start even with the autostart option checked.

olivierlambert

For autostart: just disable/re-enable it in XO, that will do the trick.

The crash dump is sadly not enough to know what packet actually cause OVS to crash

JEDIBC

@olivierlambert Is it because some packets size from wireguard are too high ? Should I lower the MTU of the wireguard servers & clients to be lower than the MTU of the host interface ?

olivierlambert

Sadly, I don't know exactly what's causing it. Ideally, if you can find a way to trigger it on purpose, that would be wonderful.

sasha

Same here. XCP-ng started to crash unexpectedly in last 3 month with no obvious reason with similar crash log:

[ 214278.799922]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 214278.799944]   INFO: PGD 0 P4D 0
[ 214278.799956]   WARN: Oops: 0000 [#1] SMP NOPTI
[ 214278.799967]   WARN: CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O      4.19.0+1 #1
[ 214278.799976]   WARN: Hardware name: Quanta Cloud Technology Inc. QuantaPlex T22HF-1U/S5HF MB, BIOS 3A05.ON02 03/20/2019
[ 214278.799994]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
[ 214278.800001]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
[ 214278.800017]   WARN: RSP: e02b:ffff888235303668 EFLAGS: 00010282
[ 214278.800025]   WARN: RAX: ffff8880a540dce0 RBX: 0000000000000000 RCX: 00000000000000c0
[ 214278.800033]   WARN: RDX: ffff8880a540dcc0 RSI: ffff8880a540dcc0 RDI: ffffea0008ff6380
[ 214278.800042]   WARN: RBP: 0000000000000000 R08: ffff8880a540dc00 R09: 0000000000000001
[ 214278.800050]   WARN: R10: 0000000000000259 R11: ffff88812597f540 R12: ffff888042ae1d00
[ 214278.800057]   WARN: R13: 0000000000000000 R14: ffff888122bec8c0 R15: 0000000000000000
[ 214278.800079]   WARN: FS:  0000000000000000(0000) GS:ffff888235300000(0000) knlGS:0000000000000000
[ 214278.800088]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
[ 214278.800095]   WARN: CR2: 0000000000000008 CR3: 00000001f049e000 CR4: 0000000000040660
[ 214278.800105]   WARN: Call Trace:
[ 214278.800113]   WARN:  <IRQ>
[ 214278.800129]   WARN:  tun_net_xmit+0x3de/0x460 [tun]
[ 214278.800140]   WARN:  dev_hard_start_xmit+0xa4/0x210
[ 214278.800151]   WARN:  sch_direct_xmit+0x10d/0x350
[ 214278.800159]   WARN:  __qdisc_run+0x167/0x4e0
[ 214278.800167]   WARN:  ? pfifo_fast_enqueue+0x92/0xf0
[ 214278.800176]   WARN:  __dev_queue_xmit+0x511/0x900
[ 214278.800189]   WARN:  do_execute_actions+0x157f/0x1750 [openvswitch]
[ 214278.800203]   WARN:  ? __wake_up_common_lock+0x87/0xc0
[ 214278.800214]   WARN:  ? __raw_callee_save_xen_vcpu_stolen+0x11/0x20
[ 214278.800226]   WARN:  ? __radix_tree_lookup+0x80/0xf0
[ 214278.800237]   WARN:  ovs_execute_actions+0x47/0x120 [openvswitch]
[ 214278.800249]   WARN:  ovs_dp_process_packet+0x7d/0x110 [openvswitch]
[ 214278.800261]   WARN:  ? key_extract+0xa53/0xd60 [openvswitch]
[ 214278.800274]   WARN:  ovs_vport_receive+0x6e/0xd0 [openvswitch]
[ 214278.800285]   WARN:  ? hrtimer_init+0x190/0x190
[ 214278.800294]   WARN:  ? xen_vcpuop_set_next_event+0x69/0xa0
[ 214278.800302]   WARN:  ? __alloc_skb+0x76/0x270
[ 214278.800312]   WARN:  ? arch_local_irq_restore+0x5/0x10
[ 214278.800320]   WARN:  ? __slab_alloc.constprop.81+0x42/0x4e
[ 214278.800327]   WARN:  ? __alloc_skb+0x76/0x270
[ 214278.800334]   WARN:  ? __kmalloc_track_caller+0x195/0x200
[ 214278.800343]   WARN:  ? __kmalloc_reserve.isra.48+0x29/0x70
[ 214278.800357]   WARN:  netdev_frame_hook+0x105/0x180 [openvswitch]
[ 214278.800367]   WARN:  __netif_receive_skb_core+0x211/0xb30
[ 214278.800377]   WARN:  __netif_receive_skb_one_core+0x36/0x70
[ 214278.800385]   WARN:  netif_receive_skb_internal+0x34/0xe0
[ 214278.800396]   WARN:  xenvif_tx_action+0x4b8/0x900
[ 214278.800406]   WARN:  xenvif_poll+0x27/0x70
[ 214278.800416]   WARN:  net_rx_action+0x2a5/0x3e0
[ 214278.800427]   WARN:  __do_softirq+0xd1/0x28c
[ 214278.800438]   WARN:  irq_exit+0xa8/0xc0
[ 214278.800448]   WARN:  xen_evtchn_do_upcall+0x2c/0x50
[ 214278.800459]   WARN:  xen_do_hypervisor_callback+0x29/0x40
[ 214278.800468]   WARN:  </IRQ>
[ 214278.800477]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
[ 214278.800485]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[ 214278.800502]   WARN: RSP: e02b:ffffc900400b3eb0 EFLAGS: 00000246
[ 214278.800510]   WARN: RAX: 0000000000000000 RBX: ffff88822c239d00 RCX: ffffffff810013aa
[ 214278.800519]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
[ 214278.800528]   WARN: RBP: 0000000000000004 R08: 000000000001ca00 R09: 0000000000000000
[ 214278.800537]   WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
[ 214278.800545]   WARN: R13: 0000000000000000 R14: ffff88822c239d00 R15: ffff88822c239d00
[ 214278.800557]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
[ 214278.800567]   WARN:  ? xen_safe_halt+0xc/0x20
[ 214278.800576]   WARN:  ? default_idle+0x1a/0x140
[ 214278.800585]   WARN:  ? do_idle+0x1ea/0x260
[ 214278.800594]   WARN:  ? cpu_startup_entry+0x6f/0x80
[ 214278.800602]   WARN: Modules linked in: tun rpcsec_gss_krb5 nfsv4 nfs fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc 8021q garp mrp stp llc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat dm_multipath i
pt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter nls_iso8859_1 nls_cp437 vfat fat raid0 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel md_mod pcbc aesni_in
tel dm_mod aes_x86_64 crypto_simd cryptd glue_helper i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables ahci libahci nvme xhci_pci libata nvme_core xhci_hcd i
xgbe(O) scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[ 214278.800746]   WARN: CR2: 0000000000000008
[ 214278.800768]   WARN: ---[ end trace 0f1c8a4f455bc1b3 ]---
[ 214280.803918]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
[ 214280.803947]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
[ 214280.803971]   WARN: RSP: e02b:ffff888235303668 EFLAGS: 00010282
[ 214280.803977]   WARN: RAX: ffff8880a540dce0 RBX: 0000000000000000 RCX: 00000000000000c0
[ 214280.803982]   WARN: RDX: ffff8880a540dcc0 RSI: ffff8880a540dcc0 RDI: ffffea0008ff6380
[ 214280.803987]   WARN: RBP: 0000000000000000 R08: ffff8880a540dc00 R09: 0000000000000001
[ 214280.803992]   WARN: R10: 0000000000000259 R11: ffff88812597f540 R12: ffff888042ae1d00
[ 214280.804003]   WARN: R13: 0000000000000000 R14: ffff888122bec8c0 R15: 0000000000000000
[ 214280.804018]   WARN: FS:  0000000000000000(0000) GS:ffff888235300000(0000) knlGS:0000000000000000
[ 214280.804027]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
[ 214280.804032]   WARN: CR2: 0000000000000008 CR3: 00000001f049e000 CR4: 0000000000040660
[ 214280.804040]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

In my setup I run opnsense (yes, FreeBSD based) on top for firewall/VPN (WG, OpenVPN). I'll check for MTU and reply here.

sasha

lspci on host:

01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)
01:00.1 Ethernet controller: Intel Corporation Ethernet Controller 10G X550T (rev 01)

MTU on OpnSense:

LAN interface (lan, xn1)	Status	up 
MTU	1500

RoadWarriorWG0 interface (opt1, wg0) Status	up 
MTU	16304

WAN interface (wan, xn0) Status	up 
MTU	1500

site2siteWG1 interface (opt2, wg1)Status	up 
MTU	1420

xcp interface (opt3, xn2) Status	up 
MTU	1500

Unassigned interface (lo0) 
MTU	16384

Unassigned interface (enc0) Status	down 
MTU	1536

Unassigned interface (pflog0) Status	down 
MTU	33160

Unassigned interface (ovpns1) Status	up 
MTU	1500

olivierlambert

We just released new patches that might solve this. Please update and keep us posted next time you have the problem

sasha

@olivierlambert
Thank you for quick fix!
I'll leave it w/o patch for couple of days to see how it works with reduced MTU on WireGuard interface before applying patch. Just in case MTU will do the trick.

JEDIBC

@olivierlambert Thanks ! Will try it as soon as I'm off vacation in early september.

sasha

@olivierlambert Server just rebooted with same error. Updates installed and waiting for next reboot

Log file almost identical:
Main differences between previous 2 days ago:
old

code_textWARN: CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O      4.19.0+1 #1

new

code_textWARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O      4.19.0+1 #1

olivierlambert

So you had the same issue after all the last updates + a "manual" reboot?

sasha

@olivierlambert said in Kernel panic on fresh install:

So you had the same issue after all the last updates + a "manual" reboot?

No reboot after updates. Just a mention, that in my case reducing MTU on heavy-utilised Wireguard interface didn't help.

Also these reboots completely unpredictable, sometimes during busy day, but more often during night hours where only backups can run.

olivierlambert

Okay so now you have the updates really installed, we'll see if it happens

sasha

@olivierlambert said in Kernel panic on fresh install:

Okay so now you have the updates really installed, we'll see if it happens

Just got crash reboot again, while trying to restart a VM from XCP-center from another VM in the pool. This reboot should apply all patches from last updates.

sasha

@olivierlambert

Just had two consecutive crashes 15 minutes apart.
This is comparison between old crash before updates and after latest updates.

sasha

New crash, same message...

olivierlambert

It's weird, OVS is not involved. So it might be something else

Any chance you know how to trigger it artificially? That would be really helpful to pinpoint the issue.