XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Very scary host reboot issue

    Scheduled Pinned Locked Moved XCP-ng
    60 Posts 6 Posters 18.6k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      Andrew Top contributor @darabontors
      last edited by

      @darabontors What ethernet card is in use on your crashing system?

      If it's using the first ethernet then ethtool -i eth0 should show enough info.

      D 1 Reply Last reply Reply Quote 0
      • D Offline
        darabontors @Andrew
        last edited by

        @Andrew
        eth4 is LAN:
        driver: igb
        version: 5.3.5.20
        firmware-version: 1.67, 0x80000fc9, 19.5.12
        expansion-rom-version:
        bus-info: 0000:0a:00.0
        supports-statistics: yes
        supports-test: yes
        supports-eeprom-access: yes
        supports-register-dump: yes
        supports-priv-flags: no

        eth5 is WAN:
        driver: igb
        version: 5.3.5.20
        firmware-version: 1.67, 0x80000fc9, 19.5.12
        expansion-rom-version:
        bus-info: 0000:0a:00.1
        supports-statistics: yes
        supports-test: yes
        supports-eeprom-access: yes
        supports-register-dump: yes
        supports-priv-flags: no

        1 Reply Last reply Reply Quote 0
        • D Offline
          darabontors
          last edited by

          It is the DELL X540 2 x 10 GbE and 2 x 1 GbE daughter board in a DELL R720.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            Could be a statistical bias, but for now, absolutely ALL the reports we had came from Dell PowerEdge servers (between x20 and x30 series). Most of the time, it was with Intel cards, but I'm not 100% sure it's due to that since the crash logs indicates that OVS crashed before the packet got even in the NIC. But it can be an "answer" packet to a specific crafted incoming packet that could cause this too 🤔

            D 1 Reply Last reply Reply Quote 0
            • D Offline
              darabontors @olivierlambert
              last edited by

              @olivierlambert I can confirm 100% that a workstation DELL that I use at one of my clients did the same thing.

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                Andrew Top contributor @darabontors
                last edited by

                @darabontors Same network interface type? I just setup an HP with the same ethernet interface (igb driver) for testing.

                1 Reply Last reply Reply Quote 0
                • T Offline
                  tuxen Top contributor @darabontors
                  last edited by

                  @darabontors said in Very scary host reboot issue:

                  Some other detail that might be unrelated: my PPPoE connection to my ISP has MTU 1492. WireGuard connection also has MTU 1492. Is this relevant in any way?

                  I'm not into firewall/tunneling stuff but shouldn't the WireGuard MTU be lower than the PPPoE one in order to fit the WG protocol overhead? I read that the default=1420 and minimum=1280. I'd first reset the WG MTU to default and also test lower values within this range if the crash still persists.

                  Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.

                  D 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Online
                    olivierlambert Vates 🪐 Co-Founder CEO
                    last edited by

                    @tuxen said in Very scary host reboot issue:

                    Regardless the tests, indeed there's a bug somewhere because a malformed packet/frame should be handled and not triggering a crash.

                    Obviously, but it might be a clue on how to trigger the bug 🙂

                    1 Reply Last reply Reply Quote 0
                    • D Offline
                      darabontors @tuxen
                      last edited by darabontors

                      @tuxen You might be on to something. I need to clarify something. I am positive this issue is related to the Windows WireGuard client. On the same host, same OPNsense VM I have 10+ SitetoSite Wireguard connections configured moving 100+ GB daily and the host never reboots. I can only trigger it from a Windows WG connection.

                      How do I verify MTU size for the WG connection in Windows 11? I cannot find it for the life of me...

                      1 Reply Last reply Reply Quote 0
                      • D Offline
                        darabontors
                        last edited by

                        I found the MTU parameter. This time it was 1420 on both OPNsense WG interface and in Windows (client side). I was happy for about 5 minutes as I wasn't able to reproduce the crash, but then it happened again. My "favorite" way to trigger it is by pausing the file transfer, waiting for a couple of minutes and then resuming it. The transfer's MB/s jumps up like crazy in Windows, then freezes until it gets in sync with the real progress of the transfer. After two tries of pausing and resuming, the crash happened.

                        @olivierlambert I use this setup on my infrastructure and my clients since at least 4 years. I never experienced this issue until as recent as September this year. You guys saw this issue ~6 months ago. Isn't there a way to backtrack any recent updates to Openswitch? I know it might be some updates on the FreeBSD side that made this openswitch bug surface just in recent times... I know there was little to no development on the WireGuard side of things this year.

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Online
                          olivierlambert Vates 🪐 Co-Founder CEO
                          last edited by olivierlambert

                          First report I heard was in April of this year, on Intel ixgbe driver on a R630 from someone using OPNsense in a VM + wireguard. I don't remember any OVS change that could explain this.

                          We had the crash around in May/June IIRC, on a Dell 430 (or 420) but on a e1000e Intel driver.

                          But after inspection, we've seen that the issue was happening inside OVS, before entering the NIC, so it might not be related to the hardware at all. Maybe the hardware "helps" to get the packet or instruction crashing OVS. But to me, there's more chances it's related to an update or something in the FreeBSD PV driver inside OPNsense or Pfsense. That would be interesting to see if something moves in that area in both projects in 2023.

                          D 1 Reply Last reply Reply Quote 0
                          • D Offline
                            darabontors @olivierlambert
                            last edited by

                            @olivierlambert said in Very scary host reboot issue:

                            FreeBSD PV driver inside OPNsense or Pfsense.

                            Who is maintaining the FreeBSD PV drivers?

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Online
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Maybe it's time to ask OPNsense devs 🙂

                              A 1 Reply Last reply Reply Quote 0
                              • A Offline
                                Andrew Top contributor @olivierlambert
                                last edited by

                                @olivierlambert Last time I asked about a Xen driver issue, they deferred to the FreeBSD maintainers.

                                1 Reply Last reply Reply Quote 0
                                • olivierlambertO Online
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  Maybe it's time to ask FreeBSD devs, then 😛

                                  D 1 Reply Last reply Reply Quote 0
                                  • D Offline
                                    darabontors @olivierlambert
                                    last edited by

                                    @olivierlambert I'm thinking of a quick workaround. What if I use pci pass-through for the LAN and WAN interfaces and I physically connect the LAN port to another non PCIe pass-through port of the server and I use that port toninterface with my other VMs via OVS? Does it make any sense? Does it seem viable to mitigate this issue?

                                    A 1 Reply Last reply Reply Quote 0
                                    • A Offline
                                      Andrew Top contributor @darabontors
                                      last edited by

                                      @darabontors Yes, if you pass the PCIe LAN/WAN hardware to the VM then it will bypass the FreeBSD Xen network drivers and the Dom0 Xen OVS drivers.

                                      FreeBSD will use its own hardware drivers for the network interfaces.

                                      You won't be able to use the interfaces for any shared VM. You won't be able to hot migrate the VM to another host.

                                      D 1 Reply Last reply Reply Quote 0
                                      • D Offline
                                        darabontors @Andrew
                                        last edited by

                                        @Andrew That makes sense. I think I'll do just this. In the meantime I'll try to replicate the phenomenon on test hardware. I really need a permanent fix for this..

                                        T 1 Reply Last reply Reply Quote 0
                                        • olivierlambertO Online
                                          olivierlambert Vates 🪐 Co-Founder CEO
                                          last edited by

                                          We'll be all happy when we'll find that bug 🙂

                                          1 Reply Last reply Reply Quote 0
                                          • T Offline
                                            tuxen Top contributor @darabontors
                                            last edited by

                                            @darabontors some additional tests that I could think of:

                                            1. Minimum WG MTU on client-side (MTU=1280);
                                            2. OPNSense with emulated e1000 interfaces (bypass the PV driver but not OVS). It'll keep the VM 'agile' (hot-migrate) but with a big cost in performance;
                                            3. The last OPNSense version 23.7.5.

                                            As for the last version, found this important info posted by the devs about a change in the MTU code [1]:

                                            Today introduces a change in MTU handling for parent interfaces mostly
                                            noticed by PPPoE use where the respective MTU values need to fit the
                                            parent plus the additional header of the VLAN or PPPoE. Should the
                                            MTU already be misconfigured to a smaller value it will be used as
                                            configured so check your configuration and clear the MTU value if you
                                            want the system to decide about the effective parent MTU size.
                                            (...)

                                            Hope it helps.


                                            [1] https://forum.opnsense.org/index.php?topic=36163.0

                                            D 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post