XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Very scary host reboot issue

    Scheduled Pinned Locked Moved XCP-ng
    60 Posts 6 Posters 15.6k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by

      Please do so. Gut feeling is something related to the MTU/wireguard, but hard to suspect anything specific at the moment 😞

      1 Reply Last reply Reply Quote 0
      • D Offline
        darabontors
        last edited by

        Guys, I might be onto something.

        I started having this issue in September this year, right after switching to a new laptop with Windows 11.

        I also have VMWare Player and VirtualBox installed on my laptop.

        I have a weird issue often with WG not being able to bring up the tunnel with an error message. I googled the error and it was something related to the other virtual network interfaces VirtualBox and VMWare player installs.

        I think the issue could be related to Windows 11 and my other Type 2 Virtualization platforms.

        I did try on my other laptop running Windows 10 and having VirtualBox installed and the host reboot isn't triggered.

        Could someone help replicate this specific combo that I have?

        D 1 Reply Last reply Reply Quote 0
        • D Offline
          darabontors @darabontors
          last edited by

          I just triggered the reboot with my setup I detailed above. I started transferring 26 GB worth of video files through my tunnel. My host restarted. I continued the transfer and now strangely somehow my tunnel is capped at 100 Mb/s.

          During the transfer when the host reboot happened I was having 300 Mb/s.

          So strange behavior.

          D 1 Reply Last reply Reply Quote 0
          • D Offline
            darabontors @darabontors
            last edited by

            I continued with the transfer capped at 100 Mb/s (capped by WireGuard most probably) and after ~8 GB transferred, suddenly my tunnel collapsed. After short while, less than 2 minutes it came back up while no host reboot happened. WireGuard crashed somehow but didn't cause the Dom0 crash.

            Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?

            T 1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Thanks for the info. Hard to tell if it's related or not, but we take any info you can provide on your setup 🙂 Thanks!

              D 1 Reply Last reply Reply Quote -1
              • D Offline
                darabontors @olivierlambert
                last edited by

                @olivierlambert Just produced another reboot. I'm closing in on the way to replicate this issue.

                1 Reply Last reply Reply Quote 1
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  That will be helpful for everyone having the issue, thanks for contributing with your time and efforts!

                  D 1 Reply Last reply Reply Quote 0
                  • D Offline
                    darabontors @olivierlambert
                    last edited by

                    @olivierlambert It's the least I can do. I really like XCP-ng and Xen Orchestra. I have around 15 clients with XCP-ng stacks in production. I run an MSP company. You understand this issue scares me a lot. Right now I'm randomly rebooting my own production server where a bunch of VM and TrueNAS backups land. I am fully motivated to mitigate this issue.

                    1 Reply Last reply Reply Quote 0
                    • D Offline
                      darabontors
                      last edited by

                      I do have an update. I tried it from a Windows 10 VM. Same issue. I uninstalled VMware Player on the Windows 10 VM just to be sure. The reboot happened.

                      I tried copying a the same file from my fileserver to my laptop and I couldn't cause the reboot. It only happens when I transfer files from my laptop to my server. So only sent traffic from the laptop's perspective to my OPNsense VM produces the reboot.

                      I checked, TX checksumming is disabled on my OPNsense VM VIFs.

                      I can confirm 100% I didn't have this issue before September this year. Maybe it is related to WireGuard version on server or client side.

                      OPNsense version 23.7.2
                      wireguard-kmod 0.0.20220615_1
                      wireguard-tools 1.0.20210914_1
                      OPNsense has xn0 for WAN and xn1 for LAN

                      On my other host that also produced the reboot the hardware setup is different. The metal itself is different but more notably WAN is connected through a Dualport Intel NIC via PCIe Passthrough. Host reboot happened while copying an ISO through the WG tunnel to the host local ISO repository. So potentially the LAN xn0 produced the vSwitch crash. It couldn't happen on the WAN interface.

                      In summary: I managed to reproduce the issue 4 times within 2 hours. It should be replicable. Maybe I'll spin up a completely new setup to try to replicate this outside my current production host.

                      A 1 Reply Last reply Reply Quote 0
                      • A Online
                        Andrew Top contributor @darabontors
                        last edited by

                        @darabontors What ethernet card is in use on your crashing system?

                        If it's using the first ethernet then ethtool -i eth0 should show enough info.

                        D 1 Reply Last reply Reply Quote 0
                        • D Offline
                          darabontors @Andrew
                          last edited by

                          @Andrew
                          eth4 is LAN:
                          driver: igb
                          version: 5.3.5.20
                          firmware-version: 1.67, 0x80000fc9, 19.5.12
                          expansion-rom-version:
                          bus-info: 0000:0a:00.0
                          supports-statistics: yes
                          supports-test: yes
                          supports-eeprom-access: yes
                          supports-register-dump: yes
                          supports-priv-flags: no

                          eth5 is WAN:
                          driver: igb
                          version: 5.3.5.20
                          firmware-version: 1.67, 0x80000fc9, 19.5.12
                          expansion-rom-version:
                          bus-info: 0000:0a:00.1
                          supports-statistics: yes
                          supports-test: yes
                          supports-eeprom-access: yes
                          supports-register-dump: yes
                          supports-priv-flags: no

                          1 Reply Last reply Reply Quote 0
                          • D Offline
                            darabontors
                            last edited by

                            It is the DELL X540 2 x 10 GbE and 2 x 1 GbE daughter board in a DELL R720.

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Could be a statistical bias, but for now, absolutely ALL the reports we had came from Dell PowerEdge servers (between x20 and x30 series). Most of the time, it was with Intel cards, but I'm not 100% sure it's due to that since the crash logs indicates that OVS crashed before the packet got even in the NIC. But it can be an "answer" packet to a specific crafted incoming packet that could cause this too 🤔

                              D 1 Reply Last reply Reply Quote 0
                              • D Offline
                                darabontors @olivierlambert
                                last edited by

                                @olivierlambert I can confirm 100% that a workstation DELL that I use at one of my clients did the same thing.

                                A 1 Reply Last reply Reply Quote 0
                                • A Online
                                  Andrew Top contributor @darabontors
                                  last edited by

                                  @darabontors Same network interface type? I just setup an HP with the same ethernet interface (igb driver) for testing.

                                  1 Reply Last reply Reply Quote 0
                                  • T Offline
                                    tuxen Top contributor @darabontors
                                    last edited by

                                    @darabontors said in Very scary host reboot issue:

                                    Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?

                                    I'm not into firewall/tunneling stuff but shouldn't the WireGuard MTU be lower than the PPPoE one in order to fit the WG protocol overhead? I read that the default=1420 and minimum=1280. I'd first reset the WG MTU to default and also test lower values within this range if the crash still persists.

                                    Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.

                                    D 1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates 🪐 Co-Founder CEO
                                      last edited by

                                      @tuxen said in Very scary host reboot issue:

                                      Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.

                                      Obviously, but it might be a clue on how to trigger the bug 🙂

                                      1 Reply Last reply Reply Quote 0
                                      • D Offline
                                        darabontors @tuxen
                                        last edited by darabontors

                                        @tuxen You might be on to something. I need to clarify something. I am positive this issue is related to the Windows WireGuard client. On the same host, same OPNsense VM I have 10+ SitetoSite Wireguard connections configured moving 100+ GB daily and the host never reboots. I can only trigger it from a Windows WG connection.

                                        How do I verify MTU size for the WG connection in Windows 11? I cannot find it for the life of me...

                                        1 Reply Last reply Reply Quote 0
                                        • D Offline
                                          darabontors
                                          last edited by

                                          I found the MTU parameter. This time it was 1420 on both OPNsense WG interface and in Windows (client side). I was happy for about 5 minutes as I wasn't able to reproduce the crash, but then it happened again. My "favorite" way to trigger it is by pausing the file transfer, waiting for a couple of minutes and then resuming it. The transfer's MB/s jumps up like crazy in Windows, then freezes until it gets in sync with the real progress of the transfer. After two tries of pausing and resuming, the crash happened.

                                          @olivierlambert I use this setup on my infrastructure and my clients since at least 4 years. I never experienced this issue until as recent as September this year. You guys saw this issue ~6 months ago. Isn't there a way to backtrack any recent updates to Openswitch? I know it might be some updates on the FreeBSD side that made this openswitch bug surface just in recent times... I know there was little to no development on the WireGuard side of things this year.

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by olivierlambert

                                            First report I heard was in April of this year, on Intel ixgbe driver on a R630 from someone using OPNsense in a VM + wireguard. I don't remember any OVS change that could explain this.

                                            We had the crash around in May/June IIRC, on a Dell 430 (or 420) but on a e1000e Intel driver.

                                            But after inspection, we've seen that the issue was happening inside OVS, before entering the NIC, so it might not be related to the hardware at all. Maybe the hardware "helps" to get the packet or instruction crashing OVS. But to me, there's more chances it's related to an update or something in the FreeBSD PV driver inside OPNsense or Pfsense. That would be interesting to see if something moves in that area in both projects in 2023.

                                            D 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post