8.3beta2 dom0 kernel panic, possibly triggered by over-mtu packet?

wttw

I'm not clear on where it's most useful to report bugs in beta releases, so...

Yesterday I installed 8.3beta2 on bare hardware (NUC24OXGv9), installed orchestra and started migrating some VMs from ESXi.

While the second VM was being migrated, everything stopped.

At that point the two VMs that had been running - one vanilla ubuntu, and the orchestra VM - weren't any more, and it looked like dom0 had rebooted.

Looking at the crash logs the relevant snippets of dom0.log seem to be:

[   7955.734205]   INFO: block tdc: sector-size: 512/512 capacity: 125829120
[  10363.803886]   WARN: vif2.0: dropped over-mtu packet: 68785 > 1500
[  10363.803905]   WARN: WARNING: CPU: 0 PID: 8940 at lib/iov_iter.c:825 page_copy_sane.part.7+0x0/0x11
[  10363.803906]   WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc nls_iso8859_1 nls_cp437 intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vfat pcbc fat aesni_intel aes_x86_64 crypto_simd dm_mod cryptd glue_helper video backlight ip_tables x_tables hid_generic usbhid hid xhci_pci igc(O) nvme xhci_hcd i40e(O) nvme_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[  10363.803925]   WARN: CPU: 0 PID: 8940 Comm: handler122 Tainted: G           O      4.19.0+1 #1
[  10363.803925]   WARN: Hardware name: Simply NUC NUC24OXGv9/AHWSA, BIOS AHWSA.1.23 04/12/2024
[  10363.803926]   WARN: RIP: e030:page_copy_sane.part.7+0x0/0x11

[...]

[  10363.806328]   WARN: CR2: 00007f0dd2f68000 CR3: 000000023e91e000 CR4: 0000000000040660
[  10363.806335]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

I've not yet been able to replicate the issue, and everything else seems to be working fine, including redoing the migration that was happening during the crash.

It looks similar to the bug referred to here - very scary host reboot issue.

The running linux VM was idle, but was connected to tailscale. Our edge device - the only thing other than the ESXi box and my macbook it would have been talking to - is running OPNSense.

bleader

Do you use wireguard on your OPNsense? We did have an issues with FreeBSD based systems and wireguard, but if I remember properly it is supposed to be fixed, still worth asking

wttw

@bleader No, the opnsense box itself doesn't have wireguard (or anything else VPN-ish) running on it. It's mostly just a NAT with the normal variety of DHCP, DNS, ... services running on it.