Very scary host reboot issue

Andrew

@olivierlambert I loaded everything and ran 1TByte of data over WireGuard and nothing failed... So, another non-fail here too.

darabontors

@olivierlambert Is there something specific I could do? A specific way to test maybe?
@Andrew Are you using WireGuard kmod in OPNsense?

Andrew

@darabontors I'm using the current OPNsense (23.7.5) install and I added the WG (2.1) plugin from the GUI. I built a WG tunnel between two OPNsense VMs and put a Debian VM attached to each firewall. Then I transfered data between the Debian VMs (through the firewall/WG tunnel).

olivierlambert

@darabontors we need all the information you can provide on your setup so we can trigger the bug.

My feeling on this is a malformed packet that is crashing OVS, maybe due to the lower MTU of wireguard, but ANY detail on the configuration/setup you have will help to build something similar, and ideally reproduce it.

Without a reproducible way to trigger the bug, it will be nearly impossible to fix it.

planedrop

Just wanted to add a few things here, I've never had this happen running pfSense VMs on all 3 of my hosts, some of them with moving quite a bit of data around between Wireguard connections, so does seem hard to reproduce.

Might be worth a try @darabontors to run this on pfSense instead of opnSense jut to see if you run into the same issue or not, may help narrow things down.

Though maybe I'm speaking out of turn here, haven't really seen this bug before so maybe pf/opn has nothing to do with it and it's just BSD.

olivierlambert

We had the issue with pfSense, so IMHO it's related a combo between FreeBSD and OVS. Likely the PV drivers in BSD that are less tested.

planedrop

@olivierlambert Gotcha, makes sense. I'll do some more testing to see if I can replicate the issue.

olivierlambert

Please do so. Gut feeling is something related to the MTU/wireguard, but hard to suspect anything specific at the moment

darabontors

Guys, I might be onto something.

I started having this issue in September this year, right after switching to a new laptop with Windows 11.

I also have VMWare Player and VirtualBox installed on my laptop.

I have a weird issue often with WG not being able to bring up the tunnel with an error message. I googled the error and it was something related to the other virtual network interfaces VirtualBox and VMWare player installs.

I think the issue could be related to Windows 11 and my other Type 2 Virtualization platforms.

I did try on my other laptop running Windows 10 and having VirtualBox installed and the host reboot isn't triggered.

Could someone help replicate this specific combo that I have?

darabontors

I just triggered the reboot with my setup I detailed above. I started transferring 26 GB worth of video files through my tunnel. My host restarted. I continued the transfer and now strangely somehow my tunnel is capped at 100 Mb/s.

During the transfer when the host reboot happened I was having 300 Mb/s.

So strange behavior.

darabontors

I continued with the transfer capped at 100 Mb/s (capped by WireGuard most probably) and after ~8 GB transferred, suddenly my tunnel collapsed. After short while, less than 2 minutes it came back up while no host reboot happened. WireGuard crashed somehow but didn't cause the Dom0 crash.

Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?

olivierlambert

Thanks for the info. Hard to tell if it's related or not, but we take any info you can provide on your setup Thanks!

darabontors

@olivierlambert Just produced another reboot. I'm closing in on the way to replicate this issue.

olivierlambert

That will be helpful for everyone having the issue, thanks for contributing with your time and efforts!

darabontors

@olivierlambert It's the least I can do. I really like XCP-ng and Xen Orchestra. I have around 15 clients with XCP-ng stacks in production. I run an MSP company. You understand this issue scares me a lot. Right now I'm randomly rebooting my own production server where a bunch of VM and TrueNAS backups land. I am fully motivated to mitigate this issue.

darabontors

I do have an update. I tried it from a Windows 10 VM. Same issue. I uninstalled VMware Player on the Windows 10 VM just to be sure. The reboot happened.

I tried copying a the same file from my fileserver to my laptop and I couldn't cause the reboot. It only happens when I transfer files from my laptop to my server. So only sent traffic from the laptop's perspective to my OPNsense VM produces the reboot.

I checked, TX checksumming is disabled on my OPNsense VM VIFs.

I can confirm 100% I didn't have this issue before September this year. Maybe it is related to WireGuard version on server or client side.

OPNsense version 23.7.2
wireguard-kmod 0.0.20220615_1
wireguard-tools 1.0.20210914_1
OPNsense has xn0 for WAN and xn1 for LAN

On my other host that also produced the reboot the hardware setup is different. The metal itself is different but more notably WAN is connected through a Dualport Intel NIC via PCIe Passthrough. Host reboot happened while copying an ISO through the WG tunnel to the host local ISO repository. So potentially the LAN xn0 produced the vSwitch crash. It couldn't happen on the WAN interface.

In summary: I managed to reproduce the issue 4 times within 2 hours. It should be replicable. Maybe I'll spin up a completely new setup to try to replicate this outside my current production host.

Andrew

@darabontors What ethernet card is in use on your crashing system?

If it's using the first ethernet then ethtool -i eth0 should show enough info.

darabontors

@Andrew
eth4 is LAN:
driver: igb
version: 5.3.5.20
firmware-version: 1.67, 0x80000fc9, 19.5.12
expansion-rom-version:
bus-info: 0000:0a:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

eth5 is WAN:
driver: igb
version: 5.3.5.20
firmware-version: 1.67, 0x80000fc9, 19.5.12
expansion-rom-version:
bus-info: 0000:0a:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

darabontors

It is the DELL X540 2 x 10 GbE and 2 x 1 GbE daughter board in a DELL R720.

olivierlambert

Could be a statistical bias, but for now, absolutely ALL the reports we had came from Dell PowerEdge servers (between x20 and x30 series). Most of the time, it was with Intel cards, but I'm not 100% sure it's due to that since the crash logs indicates that OVS crashed before the packet got even in the NIC. But it can be an "answer" packet to a specific crafted incoming packet that could cause this too