XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    kswapd0: page allocation failure under high load

    Scheduled Pinned Locked Moved Compute
    7 Posts 3 Posters 699 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W Offline
      Weppel
      last edited by

      When running VM exports the master infrequently gives multiple page allocation failures, like this one:

      [1015432.935572] kswapd0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
      [1015432.935572] kswapd0 cpuset=/ mems_allowed=0
      [1015432.935573] CPU: 4 PID: 109 Comm: kswapd0 Tainted: G           O      4.19.0+1 #1
      [1015432.935573] Hardware name: Supermicro Super Server/H12SSL-CT, BIOS 2.3 10/20/2021
      [1015432.935573] Call Trace:
      [1015432.935574]  <IRQ>
      [1015432.935574]  dump_stack+0x5a/0x73
      [1015432.935575]  warn_alloc+0xee/0x180
      [1015432.935576]  __alloc_pages_slowpath+0x84d/0xa09
      [1015432.935577]  ? get_page_from_freelist+0x14c/0xf00
      [1015432.935578]  ? ttwu_do_wakeup+0x19/0x140
      [1015432.935579]  ? _raw_spin_unlock_irqrestore+0x14/0x20
      [1015432.935580]  ? try_to_wake_up+0x54/0x450
      [1015432.935581]  __alloc_pages_nodemask+0x271/0x2b0
      [1015432.935582]  bnxt_rx_pages+0x194/0x4f0 [bnxt_en]
      [1015432.935584]  bnxt_rx_pkt+0xccd/0x1510 [bnxt_en]
      [1015432.935586]  __bnxt_poll_work+0x10e/0x2a0 [bnxt_en]
      [1015432.935588]  bnxt_poll+0x8d/0x640 [bnxt_en]
      [1015432.935589]  net_rx_action+0x2a5/0x3e0
      [1015432.935590]  __do_softirq+0xd1/0x28c
      [1015432.935590]  irq_exit+0xa8/0xc0
      [1015432.935591]  xen_evtchn_do_upcall+0x2c/0x50
      [1015432.935592]  xen_do_hypervisor_callback+0x29/0x40
      [1015432.935592]  </IRQ>
      [1015432.935593] RIP: e030:xen_hypercall_xen_version+0xa/0x20
      [1015432.935593] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
      [1015432.935594] RSP: e02b:ffffc900404af950 EFLAGS: 00000246
      [1015432.935594] RAX: 000000000004000d RBX: 000000000000000f RCX: ffffffff8100122a
      [1015432.935595] RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000
      [1015432.935595] RBP: ffffffffffffffff R08: 00000000ffffff31 R09: 000000000000000f
      [1015432.935595] R10: 0002f632ef000006 R11: 0000000000000246 R12: 0000000000000000
      [1015432.935596] R13: 0002f632ef040006 R14: ffffc900404afa50 R15: ffffc900404afac8
      [1015432.935596]  ? xen_hypercall_xen_version+0xa/0x20
      [1015432.935597]  ? xen_force_evtchn_callback+0x9/0x10
      [1015432.935598]  ? check_events+0x12/0x20
      [1015432.935598]  ? xen_irq_enable_direct+0x19/0x20
      [1015432.935599]  ? truncate_exceptional_pvec_entries.part.16+0x175/0x1d0
      [1015432.935600]  ? truncate_inode_pages_range+0x280/0x7d0
      [1015432.935601]  ? deactivate_slab.isra.74+0xef/0x400
      [1015432.935602]  ? __inode_wait_for_writeback+0x75/0xe0
      [1015432.935603]  ? init_wait_var_entry+0x40/0x40
      [1015432.935605]  ? nfs4_evict_inode+0x15/0x70 [nfsv4]
      [1015432.935606]  ? evict+0xc6/0x1a0
      [1015432.935607]  ? dispose_list+0x35/0x50
      [1015432.935608]  ? prune_icache_sb+0x52/0x70
      [1015432.935608]  ? super_cache_scan+0x13c/0x190
      [1015432.935609]  ? do_shrink_slab+0x166/0x300
      [1015432.935610]  ? shrink_slab+0xdd/0x2a0
      [1015432.935611]  ? shrink_node+0xf1/0x480
      [1015432.935612]  ? kswapd+0x2b7/0x730
      [1015432.935613]  ? kthread+0xf8/0x130
      [1015432.935614]  ? mem_cgroup_shrink_node+0x180/0x180
      [1015432.935615]  ? kthread_bind+0x10/0x10
      [1015432.935616]  ? ret_from_fork+0x22/0x40
      

      I've done extensive memory tests, all looks fine. The hosts are currently at 8.2.1 (with latest updates). This looks to be a recent issue, I have not seen this issue pop up prior to upgrading to xcp-ng 8.2.x.

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        Thanks for your feedback. Let's see if this rings a bell to @dthenot @andSmv and/or @gduperrey @stormi

        1 Reply Last reply Reply Quote 0
        • bleaderB Offline
          bleader Vates 🪐 XCP-ng Team
          last edited by

          When you say

          under high load

          Do you mean the export is creating a heavy load, or is there something else in your setup creating an heavy load at the same time?

          As there is a nfs4 in the call trace, I would guess you're doing an export from XO, so downloading the export through HTTP, and that the VM is originally on a shared SR, is that right?

          The bottom call trace is going kswapd -> shrink_slab, I would venture it is trying to free up memory, as in my first question, if there is something else putting heavy load, it could be expected, if not it could be a new leak or a higher memory consumption while doing the export for a reason or another. But I doubt about the leak as I would except it to end in an OOM more than this kind of warning.

          If you're sure the setup is similar as well as the load outside of the export, it can indeed be something from the the updates.

          W 1 Reply Last reply Reply Quote 0
          • W Offline
            Weppel @bleader
            last edited by Weppel

            The export is creating heavy load. It's doing full VM dumps with compression on all 3 nodes in the cluster at the same time (one VM per node).

            The export is done through CLI (/usr/bin/xe vm-export vm=$VM filename="$FILENAME" compress=zstd) to a locally mounted nfs4 folder, not via XO.

            The original VM storage is indeed on shared SR, a Ceph RBD.

            The setup has not changed for ~1,5 years. This issue started popping up about 1-2 months ago (could be more, I unfortunately do not have a specific time/date this happened first).

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              I know it's not an answer to your original problem, but FYI, you would have for less Dom0 load doing XO incremental backup (especially using NBD).

              W 1 Reply Last reply Reply Quote 0
              • W Offline
                Weppel @olivierlambert
                last edited by

                @olivierlambert said in kswapd0: page allocation failure under high load:

                I know it's not an answer to your original problem, but FYI, you would have for less Dom0 load doing XO incremental backup (especially using NBD).

                Thanks for the hint. I've been looking at switching to XO for these (and other) tasks but pricing is currently refraining me from switching.

                1 Reply Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by

                  Use it from the sources then 😉

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post