XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Very scary host reboot issue

    Scheduled Pinned Locked Moved XCP-ng
    60 Posts 6 Posters 15.6k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      darabontors
      last edited by

      Hello dear community,

      I have a very scary and unpleasant issue with two of my XCP-ng hosts (so far 2). I have latest stable 8.2.1 patched (everything but the latest 5 patches that appeared a few days back). I have OPNsense running as a VM providing a VPN tunnel via WireGuard.

      If I copy a large file (multiple GB) to a fileserver (on the same network as OPNsense) or to the host local ISO repostory, after a while during the copy operation the host restarts.

      I have this issue since around 2 months now. Initially I thought I had a hardware failure in my Dell R720 server that was causing the issue, but recently I experienced the same phenomenon on another server, unrelated hardware wise to my first server. The second host produced the reboot while copying an ISO file to its local ISO repository. The copy went via a WireGuard tunnel hosted by an OPNsense VM.

      No configuration change whatsoever took place in the previous months. These are production systems with critical workloads. The only thing I did was patching them via Xen Orchestra.

      Is there anything in the recent (Q2-Q3) patches that could cause something like this? How can I diagnose the problem? What can I do?

      Thanks in advance!

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        OPNsense is FreeBSD based right?

        D 1 Reply Last reply Reply Quote 0
        • D Offline
          darabontors @olivierlambert
          last edited by

          @olivierlambert That's right.

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            It's been a while we suspect an issue with FreeBSD PV drivers and OpenvSwitch. Please check your logs to see the crashdump. I'm betting few $ on an OVS crashing the dom0 because of a malformed packet coming from BSD VM.

            The issue: we never managed to reproduce the problem here.

            If you have a way to trigger it at 100%, we are very interested to investigate deeper.

            D A 2 Replies Last reply Reply Quote 0
            • D Offline
              darabontors @olivierlambert
              last edited by

              @olivierlambert That sounds like bad news to me..

              Could you please specify exactly which log files to check, with precise location of the log file?

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by olivierlambert

                Check the /var/crash folder and the related kern log.

                D 2 Replies Last reply Reply Quote 0
                • D Offline
                  darabontors @olivierlambert
                  last edited by

                  @olivierlambert Is there a way I could send you the crash logs for analysis? Figuring out these logs is way above my skill level unfortunately. Could you please help?

                  D 1 Reply Last reply Reply Quote 0
                  • D Offline
                    darabontors @darabontors
                    last edited by darabontors

                    This post is deleted!
                    D 1 Reply Last reply Reply Quote 0
                    • D Offline
                      darabontors @darabontors
                      last edited by

                      [ 334371.865769]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
                      [ 334371.865787]   INFO: PGD 2250ed067 P4D 2250ed067 PUD 228c9f067 PMD 0
                      [ 334371.865803]   WARN: Oops: 0000 [#1] SMP NOPTI
                      [ 334371.865810]   WARN: CPU: 9 PID: 57 Comm: ksoftirqd/9 Tainted: G           O      4.19.0+1 #1
                      [ 334371.865818]   WARN: Hardware name: Dell Inc. PowerEdge R720/0C4Y3R, BIOS 2.9.0 12/06/2019
                      [ 334371.865832]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
                      [ 334371.865839]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
                      [ 334371.865858]   WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282
                      [ 334371.865864]   WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0
                      [ 334371.865873]   WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880
                      [ 334371.865881]   WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98
                      [ 334371.865890]   WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00
                      [ 334371.865898]   WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000
                      [ 334371.865918]   WARN: FS:  0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000
                      [ 334371.865927]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
                      [ 334371.865935]   WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660
                      [ 334371.865949]   WARN: Call Trace:
                      [ 334371.865958]   WARN:  skb_clone+0x71/0xa0
                      [ 334371.865968]   WARN:  do_execute_actions+0x4ec/0x1750 [openvswitch]
                      [ 334371.865978]   WARN:  ? ovs_dp_process_packet+0x7d/0x110 [openvswitch]
                      [ 334371.865988]   WARN:  ? ovs_vport_receive+0x6e/0xd0 [openvswitch]
                      [ 334371.865997]   WARN:  ? arch_local_irq_restore+0x5/0x10
                      [ 334371.866005]   WARN:  ? get_page_from_freelist+0xa4f/0xf00
                      [ 334371.866012]   WARN:  ? arch_local_irq_restore+0x5/0x10
                      [ 334371.866020]   WARN:  ? get_page_from_freelist+0xa4f/0xf00
                      [ 334371.866031]   WARN:  ovs_execute_actions+0x47/0x120 [openvswitch]
                      [ 334371.866040]   WARN:  ovs_dp_process_packet+0x7d/0x110 [openvswitch]
                      [ 334371.866050]   WARN:  ? key_extract+0xa53/0xd60 [openvswitch]
                      [ 334371.866058]   WARN:  ovs_vport_receive+0x6e/0xd0 [openvswitch]
                      [ 334371.866066]   WARN:  ? __alloc_skb+0x4e/0x270
                      [ 334371.866075]   WARN:  ? notify_remote_via_irq+0x4a/0x70
                      [ 334371.866085]   WARN:  ? __raw_callee_save_xen_vcpu_stolen+0x11/0x20
                      [ 334371.866091]   WARN:  ? __alloc_skb+0x76/0x270
                      [ 334371.866100]   WARN:  ? arch_local_irq_restore+0x5/0x10
                      [ 334371.866108]   WARN:  ? __slab_alloc.constprop.81+0x42/0x4e
                      [ 334371.866114]   WARN:  ? __alloc_skb+0x4e/0x270
                      [ 334371.866120]   WARN:  ? __kmalloc_track_caller+0x58/0x200
                      [ 334371.866127]   WARN:  ? __slab_alloc.constprop.81+0x42/0x4e
                      [ 334371.866136]   WARN:  ? __kmalloc_reserve.isra.48+0x29/0x70
                      [ 334371.866146]   WARN:  netdev_frame_hook+0x105/0x180 [openvswitch]
                      [ 334371.866154]   WARN:  __netif_receive_skb_core+0x211/0xb30
                      [ 334371.866163]   WARN:  __netif_receive_skb_one_core+0x36/0x70
                      [ 334371.866170]   WARN:  netif_receive_skb_internal+0x34/0xe0
                      [ 334371.866179]   WARN:  xenvif_tx_action+0x55c/0x990
                      [ 334371.866187]   WARN:  xenvif_poll+0x27/0x70
                      [ 334371.866193]   WARN:  net_rx_action+0x2a5/0x3e0
                      [ 334371.866200]   WARN:  __do_softirq+0xd1/0x28c
                      [ 334371.866208]   WARN:  run_ksoftirqd+0x26/0x40
                      [ 334371.866215]   WARN:  smpboot_thread_fn+0x10e/0x160
                      [ 334371.866223]   WARN:  kthread+0xf8/0x130
                      [ 334371.866229]   WARN:  ? sort_range+0x20/0x20
                      [ 334371.866235]   WARN:  ? kthread_bind+0x10/0x10
                      [ 334371.866242]   WARN:  ret_from_fork+0x35/0x40
                      [ 334371.866250]   WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc dm_multipath ipt_REJECT nf_reject_ipv4 xt_tcpu$
                      [ 334371.866374]   WARN:  scsi_mod efivarfs ipv6 crc_ccitt
                      [ 334371.866384]   WARN: CR2: 0000000000000008
                      [ 334371.866396]   WARN: ---[ end trace 8b74661a79be8268 ]---
                      [ 334371.868712]   WARN: RIP: e030:skb_copy_ubufs+0x19c/0x5f0
                      [ 334371.868721]   WARN: Code: 90 cc 00 00 00 48 03 90 d0 00 00 00 48 63 44 24 40 48 83 c0 03 48 c1 e0 04 48 01 d0 48 89 18 c7 40 08 00 00 00 00 44 89 78 0c <48> 8b 43 08 a8 01 0f 85 3f 04 00 00 48 8b 44 24 30 48 83 78 20 ff
                      [ 334371.868740]   WARN: RSP: e02b:ffffc9004026b6f8 EFLAGS: 00010282
                      [ 334371.868748]   WARN: RAX: ffff888099621ae0 RBX: 0000000000000000 RCX: 00000000000000c0
                      [ 334371.868759]   WARN: RDX: ffff888099621ac0 RSI: ffff888099621ac0 RDI: ffffea00031da880
                      [ 334371.868769]   WARN: RBP: 0000000000000000 R08: ffff888099621a00 R09: ffff8881f0d43e98
                      [ 334371.868778]   WARN: R10: ffffc9004026b8b0 R11: 0000000000000000 R12: ffff888096e61c00
                      [ 334371.868788]   WARN: R13: 0000000000000000 R14: ffff88822b867a80 R15: 0000000000000000
                      [ 334371.868805]   WARN: FS:  0000000000000000(0000) GS:ffff88822d440000(0000) knlGS:0000000000000000
                      [ 334371.868815]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
                      [ 334371.868823]   WARN: CR2: 0000000000000008 CR3: 00000002281aa000 CR4: 0000000000040660
                      [ 334371.868837]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
                      
                      1 Reply Last reply Reply Quote 1
                      • D Offline
                        darabontors @olivierlambert
                        last edited by

                        @olivierlambert Is this log snippet helpful? Do I need to dig somewhere else specifically?

                        1 Reply Last reply Reply Quote 0
                        • A Offline
                          Andrew Top contributor @olivierlambert
                          last edited by

                          @darabontors @olivierlambert I use OPNsense and never had this problem... It is currently FreeBSD 13.2 based and shows management agent 6.2.0-76888. It is using the default Xen drivers included in FreeBSD 13.2. I have run many many gigabytes of data through the firewalls (several VMs). I am running an older processor (Xeon E5 v2), but no issues.

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            @darabontors Thanks, that's what I suspected…

                            @Andrew as I said, the hard part of this bug is to make it reproducible. It seems to be related to Wireguard AND FreeBSD, as you can see OVS is crashing the whole Dom0 kernel at some point. Xen detects it, and finally decide (logically) to reboot the Dom0.

                            A 1 Reply Last reply Reply Quote 0
                            • A Offline
                              Andrew Top contributor @olivierlambert
                              last edited by

                              @olivierlambert I'll spinup some OPNsense/WireGuard VMs and see what happens for me...

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Thanks, we are doing the same. However, I'm not optimistic on the possibility to trigger it artificially… We have thousands of users all around the world using BSD+wireguard. We even do it here.

                                And yes, we had the issue 6 months ago. And then it stopped, and never happened again. It's really a difficult problem to investigate in the first place.

                                A D 2 Replies Last reply Reply Quote 0
                                • A Offline
                                  Andrew Top contributor @olivierlambert
                                  last edited by

                                  @olivierlambert I loaded everything and ran 1TByte of data over WireGuard and nothing failed... So, another non-fail here too.

                                  1 Reply Last reply Reply Quote 0
                                  • D Offline
                                    darabontors @olivierlambert
                                    last edited by

                                    @olivierlambert Is there something specific I could do? A specific way to test maybe?
                                    @Andrew Are you using WireGuard kmod in OPNsense?

                                    A 1 Reply Last reply Reply Quote 0
                                    • A Offline
                                      Andrew Top contributor @darabontors
                                      last edited by

                                      @darabontors I'm using the current OPNsense (23.7.5) install and I added the WG (2.1) plugin from the GUI. I built a WG tunnel between two OPNsense VMs and put a Debian VM attached to each firewall. Then I transfered data between the Debian VMs (through the firewall/WG tunnel).

                                      1 Reply Last reply Reply Quote 0
                                      • olivierlambertO Offline
                                        olivierlambert Vates 🪐 Co-Founder CEO
                                        last edited by

                                        @darabontors we need all the information you can provide on your setup so we can trigger the bug.

                                        My feeling on this is a malformed packet that is crashing OVS, maybe due to the lower MTU of wireguard, but ANY detail on the configuration/setup you have will help to build something similar, and ideally reproduce it.

                                        Without a reproducible way to trigger the bug, it will be nearly impossible to fix it.

                                        1 Reply Last reply Reply Quote 0
                                        • planedropP Offline
                                          planedrop Top contributor
                                          last edited by

                                          Just wanted to add a few things here, I've never had this happen running pfSense VMs on all 3 of my hosts, some of them with moving quite a bit of data around between Wireguard connections, so does seem hard to reproduce.

                                          Might be worth a try @darabontors to run this on pfSense instead of opnSense jut to see if you run into the same issue or not, may help narrow things down.

                                          Though maybe I'm speaking out of turn here, haven't really seen this bug before so maybe pf/opn has nothing to do with it and it's just BSD.

                                          1 Reply Last reply Reply Quote 0
                                          • olivierlambertO Offline
                                            olivierlambert Vates 🪐 Co-Founder CEO
                                            last edited by

                                            We had the issue with pfSense, so IMHO it's related a combo between FreeBSD and OVS. Likely the PV drivers in BSD that are less tested.

                                            planedropP 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post