The Rocky Linux bugtracker indeed mentions it's mostly fixed, but there are still some kernel errors present: https://bugs.rockylinux.org/view.php?id=3565#c4293
Posts
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
FYI this is not fixed yet in the latest EL kernel 4.18.0-477.15.1.el8_8.x86_64
-
RE: kswapd0: page allocation failure under high load
@olivierlambert said in kswapd0: page allocation failure under high load:
I know it's not an answer to your original problem, but FYI, you would have for less Dom0 load doing XO incremental backup (especially using NBD).
Thanks for the hint. I've been looking at switching to XO for these (and other) tasks but pricing is currently refraining me from switching.
-
RE: kswapd0: page allocation failure under high load
The export is creating heavy load. It's doing full VM dumps with compression on all 3 nodes in the cluster at the same time (one VM per node).
The export is done through CLI (
/usr/bin/xe vm-export vm=$VM filename="$FILENAME" compress=zstd
) to a locally mounted nfs4 folder, not via XO.The original VM storage is indeed on shared SR, a Ceph RBD.
The setup has not changed for ~1,5 years. This issue started popping up about 1-2 months ago (could be more, I unfortunately do not have a specific time/date this happened first).
-
kswapd0: page allocation failure under high load
When running VM exports the master infrequently gives multiple page allocation failures, like this one:
[1015432.935572] kswapd0: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null) [1015432.935572] kswapd0 cpuset=/ mems_allowed=0 [1015432.935573] CPU: 4 PID: 109 Comm: kswapd0 Tainted: G O 4.19.0+1 #1 [1015432.935573] Hardware name: Supermicro Super Server/H12SSL-CT, BIOS 2.3 10/20/2021 [1015432.935573] Call Trace: [1015432.935574] <IRQ> [1015432.935574] dump_stack+0x5a/0x73 [1015432.935575] warn_alloc+0xee/0x180 [1015432.935576] __alloc_pages_slowpath+0x84d/0xa09 [1015432.935577] ? get_page_from_freelist+0x14c/0xf00 [1015432.935578] ? ttwu_do_wakeup+0x19/0x140 [1015432.935579] ? _raw_spin_unlock_irqrestore+0x14/0x20 [1015432.935580] ? try_to_wake_up+0x54/0x450 [1015432.935581] __alloc_pages_nodemask+0x271/0x2b0 [1015432.935582] bnxt_rx_pages+0x194/0x4f0 [bnxt_en] [1015432.935584] bnxt_rx_pkt+0xccd/0x1510 [bnxt_en] [1015432.935586] __bnxt_poll_work+0x10e/0x2a0 [bnxt_en] [1015432.935588] bnxt_poll+0x8d/0x640 [bnxt_en] [1015432.935589] net_rx_action+0x2a5/0x3e0 [1015432.935590] __do_softirq+0xd1/0x28c [1015432.935590] irq_exit+0xa8/0xc0 [1015432.935591] xen_evtchn_do_upcall+0x2c/0x50 [1015432.935592] xen_do_hypervisor_callback+0x29/0x40 [1015432.935592] </IRQ> [1015432.935593] RIP: e030:xen_hypercall_xen_version+0xa/0x20 [1015432.935593] Code: 51 41 53 b8 10 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 11 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc [1015432.935594] RSP: e02b:ffffc900404af950 EFLAGS: 00000246 [1015432.935594] RAX: 000000000004000d RBX: 000000000000000f RCX: ffffffff8100122a [1015432.935595] RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000000 [1015432.935595] RBP: ffffffffffffffff R08: 00000000ffffff31 R09: 000000000000000f [1015432.935595] R10: 0002f632ef000006 R11: 0000000000000246 R12: 0000000000000000 [1015432.935596] R13: 0002f632ef040006 R14: ffffc900404afa50 R15: ffffc900404afac8 [1015432.935596] ? xen_hypercall_xen_version+0xa/0x20 [1015432.935597] ? xen_force_evtchn_callback+0x9/0x10 [1015432.935598] ? check_events+0x12/0x20 [1015432.935598] ? xen_irq_enable_direct+0x19/0x20 [1015432.935599] ? truncate_exceptional_pvec_entries.part.16+0x175/0x1d0 [1015432.935600] ? truncate_inode_pages_range+0x280/0x7d0 [1015432.935601] ? deactivate_slab.isra.74+0xef/0x400 [1015432.935602] ? __inode_wait_for_writeback+0x75/0xe0 [1015432.935603] ? init_wait_var_entry+0x40/0x40 [1015432.935605] ? nfs4_evict_inode+0x15/0x70 [nfsv4] [1015432.935606] ? evict+0xc6/0x1a0 [1015432.935607] ? dispose_list+0x35/0x50 [1015432.935608] ? prune_icache_sb+0x52/0x70 [1015432.935608] ? super_cache_scan+0x13c/0x190 [1015432.935609] ? do_shrink_slab+0x166/0x300 [1015432.935610] ? shrink_slab+0xdd/0x2a0 [1015432.935611] ? shrink_node+0xf1/0x480 [1015432.935612] ? kswapd+0x2b7/0x730 [1015432.935613] ? kthread+0xf8/0x130 [1015432.935614] ? mem_cgroup_shrink_node+0x180/0x180 [1015432.935615] ? kthread_bind+0x10/0x10 [1015432.935616] ? ret_from_fork+0x22/0x40
I've done extensive memory tests, all looks fine. The hosts are currently at 8.2.1 (with latest updates). This looks to be a recent issue, I have not seen this issue pop up prior to upgrading to xcp-ng 8.2.x.
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
@bleader Thank you very much for the quick discovery of this, impressive work! I'm glad I could help!
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
@stormi Thanks for noticing, my bad
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
I've created a bug report at Rocky Linux: https://bugs.rockylinux.org/view.php?id=3565
Feel free to add to this if I missed any relevant information.
-
RE: Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
@olivierlambert Thank you very much for the quick follow-up.
I've done some testing with a colleague and it looks to be kernel related. The stock Rocky Linux 8.8 kernel (4.18.0-477.13.1.el8_8.x86_64) causes the reboot to happen. Upgrading the kernel to kernel-lt (5.4.245-1.1.el8.elrepo.x86_64) allows the VM to be live-migrated again without reboot/crash.
-
Live migrate of Rocky Linux 8.8 VM crashes/reboots VM
I've updated a few VM's from Rocky Linux 8.7 to 8.8. Live migrating these 8.8 VM's now causes a reboot of the VM instead of a regular live migrate. This issue doesn't happen with Rocky Linux 8.7.
I've tried updating the xe-tools to the latest version available on github (7.20.2-1), this didn't solve the issue.
Is this a known issue already? Any idea how to debug this further?