Very scary host reboot issue

darabontors

@olivierlambert Is there a way I could send you the crash logs for analysis? Figuring out these logs is way above my skill level unfortunately. Could you please help?

darabontors

This post is deleted!

darabontors

[ 334371.865769]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 334371.865787]   INFO: PGD 2250ed067 P4D 2250ed067 PUD 228c9f067 PMD 0
[ 334371.865803]   WARN: Oops: 0000 [#1] SMP NOPTI
[ 334371.865810]   WARN: CPU: 9 PID: 57 Comm: ksoftirqd/9 Tainted: G           O      4.19.0+1 #1
[ 334371.865818]   WARN: Hardware name: Dell Inc. PowerEdge R720/0C4Y3R, BIOS 2.9.0 12/06/2019
[ 334371.865832]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
[ 334371.865839]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
[ 334371.865858]   WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282
[ 334371.865864]   WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0
[ 334371.865873]   WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880
[ 334371.865881]   WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98
[ 334371.865890]   WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00
[ 334371.865898]   WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000
[ 334371.865918]   WARN: FS:  0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000
[ 334371.865927]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 334371.865935]   WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660
[ 334371.865949]   WARN: Call Trace:
[ 334371.865958]   WARN:  skb_clone+0x71/0xa0
[ 334371.865968]   WARN:  do_execute_actions+0x4ec/0x1750 [openvswitch]
[ 334371.865978]   WARN:  ? ovs_dp_process_packet+0x7d/0x110 [openvswitch]
[ 334371.865988]   WARN:  ? ovs_vport_receive+0x6e/0xd0 [openvswitch]
[ 334371.865997]   WARN:  ? arch_local_irq_restore+0x5/0x10
[ 334371.866005]   WARN:  ? get_page_from_freelist+0xa4f/0xf00
[ 334371.866012]   WARN:  ? arch_local_irq_restore+0x5/0x10
[ 334371.866020]   WARN:  ? get_page_from_freelist+0xa4f/0xf00
[ 334371.866031]   WARN:  ovs_execute_actions+0x47/0x120 [openvswitch]
[ 334371.866040]   WARN:  ovs_dp_process_packet+0x7d/0x110 [openvswitch]
[ 334371.866050]   WARN:  ? key_extract+0xa53/0xd60 [openvswitch]
[ 334371.866058]   WARN:  ovs_vport_receive+0x6e/0xd0 [openvswitch]
[ 334371.866066]   WARN:  ? __alloc_skb+0x4e/0x270
[ 334371.866075]   WARN:  ? notify_remote_via_irq+0x4a/0x70
[ 334371.866085]   WARN:  ? __raw_callee_save_xen_vcpu_stolen+0x11/0x20
[ 334371.866091]   WARN:  ? __alloc_skb+0x76/0x270
[ 334371.866100]   WARN:  ? arch_local_irq_restore+0x5/0x10
[ 334371.866108]   WARN:  ? __slab_alloc.constprop.81+0x42/0x4e
[ 334371.866114]   WARN:  ? __alloc_skb+0x4e/0x270
[ 334371.866120]   WARN:  ? __kmalloc_track_caller+0x58/0x200
[ 334371.866127]   WARN:  ? __slab_alloc.constprop.81+0x42/0x4e
[ 334371.866136]   WARN:  ? __kmalloc_reserve.isra.48+0x29/0x70
[ 334371.866146]   WARN:  netdev_frame_hook+0x105/0x180 [openvswitch]
[ 334371.866154]   WARN:  __netif_receive_skb_core+0x211/0xb30
[ 334371.866163]   WARN:  __netif_receive_skb_one_core+0x36/0x70
[ 334371.866170]   WARN:  netif_receive_skb_internal+0x34/0xe0
[ 334371.866179]   WARN:  xenvif_tx_action+0x55c/0x990
[ 334371.866187]   WARN:  xenvif_poll+0x27/0x70
[ 334371.866193]   WARN:  net_rx_action+0x2a5/0x3e0
[ 334371.866200]   WARN:  __do_softirq+0xd1/0x28c
[ 334371.866208]   WARN:  run_ksoftirqd+0x26/0x40
[ 334371.866215]   WARN:  smpboot_thread_fn+0x10e/0x160
[ 334371.866223]   WARN:  kthread+0xf8/0x130
[ 334371.866229]   WARN:  ? sort_range+0x20/0x20
[ 334371.866235]   WARN:  ? kthread_bind+0x10/0x10
[ 334371.866242]   WARN:  ret_from_fork+0x35/0x40
[ 334371.866250]   WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpu$
[ 334371.866374]   WARN:  scsi_mod efivarfs ipv6 crc_ccitt
[ 334371.866384]   WARN: CR2: 0000000000000008
[ 334371.866396]   WARN: ---[ end trace 8b74661a79be8268 ]---
[ 334371.868712]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
[ 334371.868721]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
[ 334371.868740]   WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282
[ 334371.868748]   WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0
[ 334371.868759]   WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880
[ 334371.868769]   WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98
[ 334371.868778]   WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00
[ 334371.868788]   WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000
[ 334371.868805]   WARN: FS:  0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000
[ 334371.868815]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 334371.868823]   WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660
[ 334371.868837]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

darabontors

@olivierlambert Is this log snippet helpful? Do I need to dig somewhere else specifically?

Andrew

@darabontors @olivierlambert I use OPNsense and never had this problem... It is currently FreeBSD 13.2 based and shows management agent 6.2.0-76888. It is using the default Xen drivers included in FreeBSD 13.2. I have run many many gigabytes of data through the firewalls (several VMs). I am running an older processor (Xeon E5 v2), but no issues.

olivierlambert

@darabontors Thanks, that's what I suspected…

@Andrew as I said, the hard part of this bug is to make it reproducible. It seems to be related to Wireguard AND FreeBSD, as you can see OVS is crashing the whole Dom0 kernel at some point. Xen detects it, and finally decide (logically) to reboot the Dom0.

Andrew

@olivierlambert I'll spinup some OPNsense/WireGuard VMs and see what happens for me...

olivierlambert

Thanks, we are doing the same. However, I'm not optimistic on the possibility to trigger it artificially… We have thousands of users all around the world using BSD+wireguard. We even do it here.

And yes, we had the issue 6 months ago. And then it stopped, and never happened again. It's really a difficult problem to investigate in the first place.

Andrew

@olivierlambert I loaded everything and ran 1TByte of data over WireGuard and nothing failed... So, another non-fail here too.

darabontors

@olivierlambert Is there something specific I could do? A specific way to test maybe?
@Andrew Are you using WireGuard kmod in OPNsense?

Andrew

@darabontors I'm using the current OPNsense (23.7.5) install and I added the WG (2.1) plugin from the GUI. I built a WG tunnel between two OPNsense VMs and put a Debian VM attached to each firewall. Then I transfered data between the Debian VMs (through the firewall/WG tunnel).

olivierlambert

@darabontors we need all the information you can provide on your setup so we can trigger the bug.

My feeling on this is a malformed packet that is crashing OVS, maybe due to the lower MTU of wireguard, but ANY detail on the configuration/setup you have will help to build something similar, and ideally reproduce it.

Without a reproducible way to trigger the bug, it will be nearly impossible to fix it.

planedrop

Just wanted to add a few things here, I've never had this happen running pfSense VMs on all 3 of my hosts, some of them with moving quite a bit of data around between Wireguard connections, so does seem hard to reproduce.

Might be worth a try @darabontors to run this on pfSense instead of opnSense jut to see if you run into the same issue or not, may help narrow things down.

Though maybe I'm speaking out of turn here, haven't really seen this bug before so maybe pf/opn has nothing to do with it and it's just BSD.

olivierlambert

We had the issue with pfSense, so IMHO it's related a combo between FreeBSD and OVS. Likely the PV drivers in BSD that are less tested.

planedrop

@olivierlambert Gotcha, makes sense. I'll do some more testing to see if I can replicate the issue.

olivierlambert

Please do so. Gut feeling is something related to the MTU/wireguard, but hard to suspect anything specific at the moment

darabontors

Guys, I might be onto something.

I started having this issue in September this year, right after switching to a new laptop with Windows 11.

I also have VMWare Player and VirtualBox installed on my laptop.

I have a weird issue often with WG not being able to bring up the tunnel with an error message. I googled the error and it was something related to the other virtual network interfaces VirtualBox and VMWare player installs.

I think the issue could be related to Windows 11 and my other Type 2 Virtualization platforms.

I did try on my other laptop running Windows 10 and having VirtualBox installed and the host reboot isn't triggered.

Could someone help replicate this specific combo that I have?

darabontors

I just triggered the reboot with my setup I detailed above. I started transferring 26 GB worth of video files through my tunnel. My host restarted. I continued the transfer and now strangely somehow my tunnel is capped at 100 Mb/s.

During the transfer when the host reboot happened I was having 300 Mb/s.

So strange behavior.

darabontors

I continued with the transfer capped at 100 Mb/s (capped by WireGuard most probably) and after ~8 GB transferred, suddenly my tunnel collapsed. After short while, less than 2 minutes it came back up while no host reboot happened. WireGuard crashed somehow but didn't cause the Dom0 crash.

Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?

olivierlambert

Thanks for the info. Hard to tell if it's related or not, but we take any info you can provide on your setup Thanks!