kswapd0: page allocation failure under high load

Weppel

When running VM exports the master infrequently gives multiple page allocation failures, like this one:

[1015432.935572] kswapd0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
[1015432.935572] kswapd0 cpuset=/ mems_allowed=0
[1015432.935573] CPU: 4 PID: 109 Comm: kswapd0 Tainted: G           O      4.19.0+1 #1
[1015432.935573] Hardware name: Supermicro Super Server/H12SSL-CT, BIOS 2.3 10/20/2021
[1015432.935573] Call Trace:
[1015432.935574]  <IRQ>
[1015432.935574]  dump_stack+0x5a/0x73
[1015432.935575]  warn_alloc+0xee/0x180
[1015432.935576]  __alloc_pages_slowpath+0x84d/0xa09
[1015432.935577]  ? get_page_from_freelist+0x14c/0xf00
[1015432.935578]  ? ttwu_do_wakeup+0x19/0x140
[1015432.935579]  ? _raw_spin_unlock_irqrestore+0x14/0x20
[1015432.935580]  ? try_to_wake_up+0x54/0x450
[1015432.935581]  __alloc_pages_nodemask+0x271/0x2b0
[1015432.935582]  bnxt_rx_pages+0x194/0x4f0 [bnxt_en]
[1015432.935584]  bnxt_rx_pkt+0xccd/0x1510 [bnxt_en]
[1015432.935586]  __bnxt_poll_work+0x10e/0x2a0 [bnxt_en]
[1015432.935588]  bnxt_poll+0x8d/0x640 [bnxt_en]
[1015432.935589]  net_rx_action+0x2a5/0x3e0
[1015432.935590]  __do_softirq+0xd1/0x28c
[1015432.935590]  irq_exit+0xa8/0xc0
[1015432.935591]  xen_evtchn_do_upcall+0x2c/0x50
[1015432.935592]  xen_do_hypervisor_callback+0x29/0x40
[1015432.935592]  </IRQ>
[1015432.935593] RIP: e030:xen_hypercall_xen_version+0xa/0x20
[1015432.935593] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[1015432.935594] RSP: e02b:ffffc900404af950 EFLAGS: 00000246
[1015432.935594] RAX: 000000000004000d RBX: 000000000000000f RCX: ffffffff8100122a
[1015432.935595] RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000
[1015432.935595] RBP: ffffffffffffffff R08: 00000000ffffff31 R09: 000000000000000f
[1015432.935595] R10: 0002f632ef000006 R11: 0000000000000246 R12: 0000000000000000
[1015432.935596] R13: 0002f632ef040006 R14: ffffc900404afa50 R15: ffffc900404afac8
[1015432.935596]  ? xen_hypercall_xen_version+0xa/0x20
[1015432.935597]  ? xen_force_evtchn_callback+0x9/0x10
[1015432.935598]  ? check_events+0x12/0x20
[1015432.935598]  ? xen_irq_enable_direct+0x19/0x20
[1015432.935599]  ? truncate_exceptional_pvec_entries.part.16+0x175/0x1d0
[1015432.935600]  ? truncate_inode_pages_range+0x280/0x7d0
[1015432.935601]  ? deactivate_slab.isra.74+0xef/0x400
[1015432.935602]  ? __inode_wait_for_writeback+0x75/0xe0
[1015432.935603]  ? init_wait_var_entry+0x40/0x40
[1015432.935605]  ? nfs4_evict_inode+0x15/0x70 [nfsv4]
[1015432.935606]  ? evict+0xc6/0x1a0
[1015432.935607]  ? dispose_list+0x35/0x50
[1015432.935608]  ? prune_icache_sb+0x52/0x70
[1015432.935608]  ? super_cache_scan+0x13c/0x190
[1015432.935609]  ? do_shrink_slab+0x166/0x300
[1015432.935610]  ? shrink_slab+0xdd/0x2a0
[1015432.935611]  ? shrink_node+0xf1/0x480
[1015432.935612]  ? kswapd+0x2b7/0x730
[1015432.935613]  ? kthread+0xf8/0x130
[1015432.935614]  ? mem_cgroup_shrink_node+0x180/0x180
[1015432.935615]  ? kthread_bind+0x10/0x10
[1015432.935616]  ? ret_from_fork+0x22/0x40

I've done extensive memory tests, all looks fine. The hosts are currently at 8.2.1 (with latest updates). This looks to be a recent issue, I have not seen this issue pop up prior to upgrading to xcp-ng 8.2.x.

olivierlambert

Hi,

Thanks for your feedback. Let's see if this rings a bell to @dthenot @andSmv and/or @gduperrey @stormi

bleader

When you say

under high load

Do you mean the export is creating a heavy load, or is there something else in your setup creating an heavy load at the same time?

As there is a nfs4 in the call trace, I would guess you're doing an export from XO, so downloading the export through HTTP, and that the VM is originally on a shared SR, is that right?

The bottom call trace is going kswapd -> shrink_slab, I would venture it is trying to free up memory, as in my first question, if there is something else putting heavy load, it could be expected, if not it could be a new leak or a higher memory consumption while doing the export for a reason or another. But I doubt about the leak as I would except it to end in an OOM more than this kind of warning.

If you're sure the setup is similar as well as the load outside of the export, it can indeed be something from the the updates.

Weppel

The export is creating heavy load. It's doing full VM dumps with compression on all 3 nodes in the cluster at the same time (one VM per node).

The export is done through CLI (/usr/bin/xe vm-export vm=$VM filename="$FILENAME" compress=zstd) to a locally mounted nfs4 folder, not via XO.

The original VM storage is indeed on shared SR, a Ceph RBD.

The setup has not changed for ~1,5 years. This issue started popping up about 1-2 months ago (could be more, I unfortunately do not have a specific time/date this happened first).

olivierlambert

I know it's not an answer to your original problem, but FYI, you would have for less Dom0 load doing XO incremental backup (especially using NBD).

Weppel

@olivierlambert said in kswapd0: page allocation failure under high load:

I know it's not an answer to your original problem, but FYI, you would have for less Dom0 load doing XO incremental backup (especially using NBD).

Thanks for the hint. I've been looking at switching to XO for these (and other) tasks but pricing is currently refraining me from switching.

olivierlambert

Use it from the sources then