[HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached

dnikola

Hi everyone,

I'm reaching out to ask for help analyzing a recent crash on one of our XCP-ng hosts.
We experienced a dom0 kernel panic caused by a page fault in the TCP stack. I’ve collected and parsed the crash dumps and would appreciate your feedback, recommendations, or confirmation whether this is a known issue.

Environment:
XCP-ng hypervisor version: 4.17.5-13
Dom0 kernel: 4.19.0+1
CPU: Intel
24 physical CPUs (PCPUs)

Crashkernel configured and working via kexec

Crash summary:
The crash occurred due to a page fault inside the TCP stack of dom0 kernel at virtual address:

0xffff8882b6ed0000 - level 1 page table not present
This triggered an NMI (Non-Maskable Interrupt) and crash dump via kexec.

dom0.log details:
WARN paging error for vaddr 0xffff8882b6ed0000 - level 1 not present

Call Trace:
[ffffffff81071fa5] panic+0x111/0x27c
[ffffffff8102796f] oops_end+0xcf/0xd0
[ffffffff8105da73] no_context+0x1b3/0x3c0
[ffffffff8169885c] tcp_check_space+0x4c/0xf0
[ffffffff8105e33a] __do_page_fault+0xaa/0x4f0
Indicates page fault in tcp_check_space() leading to kernel panic.

xen.log details:
Xen hypervisor correctly triggered NMI crash handling across all PCPUs.
Key stack trace on all CPUs:

kexec_crash_save_cpu()
do_nmi_crash()
Most CPUs were in idle loops (xen_safe_halt), while dom0 VCPU0 crashed on PCPU20.

xen-crashdump-analyser output:
Several warnings like:

WARN Cannot get kernel page table address - VCPU assumed down
indicating missing page tables at crash time for multiple VCPUs.

Confirmed same paging error on the same virtual address:

WARN paging error for vaddr 0xffff8882b6ed0000 - level 1 not present
Dom0 had VCPUs active on PCPU18, PCPU0, PCPU11, PCPU2 at crash time.

Several guest VMs had VCPUs active, suggesting moderate to high workload.

Summary:
It seems to be a page fault on a virtual memory address inside the dom0 TCP stack that caused a panic.
The crashdump shows memory mapping inconsistencies (missing page tables) for various VCPUs after the crash was triggered.
This might suggest a kernel bug, unstable network driver, or potential hardware-related issue like sporadic memory corruption.

Questions:

Has anyone experienced similar page faults in the dom0 TCP stack on 4.19 kernels or XCP-ng 4.17.5?
Are there any known issues with network drivers on this kernel/hypervisor combo?
Would you recommend moving to a newer dom0 kernel or hypervisor build?
Could a memory issue cause this specific kind of page table inconsistency during a kernel panic?
Any advice on additional debug steps or log files I should collect next time?

Download crashdump package

Thank you all in advance for your time and input — really appreciate it!

olivierlambert

It's a lot easier just to read the right section in the dmeg:

[ 604291.140493]  ALERT: BUG: unable to handle kernel paging request at 000000000100001c
[ 604291.140503]   INFO: PGD 10b740067 P4D 10b740067 PUD 107933067 PMD 0 
[ 604291.140509]   WARN: Oops: 0000 [#1] SMP NOPTI
[ 604291.140513]   WARN: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      4.19.0+1 #1
[ 604291.140516]   WARN: Hardware name: Gigabyte Technology Co., Ltd. Z690 UD/Z690 UD, BIOS F6 01/27/2022
[ 604291.140524]   WARN: RIP: e030:rtl8125_rx_interrupt+0x2ae/0x7d0 [r8125]
[ 604291.140527]   WARN: Code: 53 14 74 50 48 8b 4c 24 60 65 48 33 0c 25 28 00 00 00 44 89 e8 0f 85 64 04 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <45> 8b 75 1c e9 fe fd ff ff f6 85 e9 35 00 00 02 0f 85 68 04 00 00
[ 604291.140534]   WARN: RSP: e02b:ffff8882b6803e40 EFLAGS: 00010246
[ 604291.140537]   WARN: RAX: 0000000000000000 RBX: ffff8882a5aa0068 RCX: 00000000684ce000
[ 604291.140540]   WARN: RDX: 0000000000000000 RSI: 0000000001000000 RDI: 0000000000044211
[ 604291.140543]   WARN: RBP: ffff8882a5a808c0 R08: 00000000000531a1 R09: 000000001458e000
[ 604291.140546]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8882a5a80000
[ 604291.140550]   WARN: R13: 0000000001000000 R14: 0000000000000004 R15: 0000000000000000
[ 604291.140558]   WARN: FS:  0000000000000000(0000) GS:ffff8882b6800000(0000) knlGS:0000000000000000
[ 604291.140562]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 604291.140565]   WARN: CR2: 000000000100001c CR3: 000000000723e000 CR4: 0000000000040660
[ 604291.140570]   WARN: Call Trace:
[ 604291.140574]   WARN:  <IRQ>
[ 604291.140580]   WARN:  ? __napi_schedule+0x4a/0x50
[ 604291.140584]   WARN:  ? rtl8125_interrupt_msix+0x62/0xe0 [r8125]
[ 604291.140589]   WARN:  ? __handle_irq_event_percpu+0x4d/0x1a0
[ 604291.140593]   WARN:  rtl8125_poll_msix_rx+0x4a/0x90 [r8125]
[ 604291.140596]   WARN:  net_rx_action+0x2a5/0x3e0
[ 604291.140600]   WARN:  __do_softirq+0xd1/0x28c
[ 604291.140604]   WARN:  irq_exit+0xa8/0xc0
[ 604291.140608]   WARN:  xen_evtchn_do_upcall+0x2c/0x50
[ 604291.140611]   WARN:  xen_do_hypervisor_callback+0x29/0x40
[ 604291.140614]   WARN:  </IRQ>
[ 604291.140617]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
[ 604291.140620]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
[ 604291.140626]   WARN: RSP: e02b:ffffffff82003e58 EFLAGS: 00000246
[ 604291.140629]   WARN: RAX: 0000000000000000 RBX: ffffffff82011740 RCX: ffffffff810013aa
[ 604291.140632]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
[ 604291.140635]   WARN: RBP: 0000000000000000 R08: 0000000000000008 R09: 000225a01b5e9188
[ 604291.140639]   WARN: R10: ffffc9004178b930 R11: 0000000000000246 R12: 0000000000000000
[ 604291.140642]   WARN: R13: 0000000000000000 R14: ffffffff82011740 R15: ffffffff82011740
[ 604291.140646]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
[ 604291.140650]   WARN:  ? xen_safe_halt+0xc/0x20
[ 604291.140653]   WARN:  ? default_idle+0x1a/0x140
[ 604291.140656]   WARN:  ? do_idle+0x1ea/0x260
[ 604291.140659]   WARN:  ? cpu_startup_entry+0x6f/0x80
[ 604291.140663]   WARN:  ? start_kernel+0x558/0x578
[ 604291.140666]   WARN:  ? set_init_arg+0x55/0x55
[ 604291.140668]   WARN:  ? xen_start_kernel+0x5a0/0x5aa
[ 604291.140671]   WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc nfnetlink_cttimeout nfnetlink openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sunrpc nls_iso8859_1 nls_cp437 vfat fat hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc usbhid hid dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg video backlight ip_tables x_tables raid1 sd_mod md_mod ahci libahci nvme xhci_pci r8125(O) nvme_core libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[ 604291.140714]   WARN: CR2: 000000000100001c
[ 604291.140720]   WARN: ---[ end trace d652c60ff1bf708b ]---
[ 604291.140724]   WARN: RIP: e030:rtl8125_rx_interrupt+0x2ae/0x7d0 [r8125]
[ 604291.140727]   WARN: Code: 53 14 74 50 48 8b 4c 24 60 65 48 33 0c 25 28 00 00 00 44 89 e8 0f 85 64 04 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <45> 8b 75 1c e9 fe fd ff ff f6 85 e9 35 00 00 02 0f 85 68 04 00 00
[ 604291.140733]   WARN: RSP: e02b:ffff8882b6803e40 EFLAGS: 00010246
[ 604291.140736]   WARN: RAX: 0000000000000000 RBX: ffff8882a5aa0068 RCX: 00000000684ce000
[ 604291.140739]   WARN: RDX: 0000000000000000 RSI: 0000000001000000 RDI: 0000000000044211
[ 604291.140743]   WARN: RBP: ffff8882a5a808c0 R08: 00000000000531a1 R09: 000000001458e000
[ 604291.140746]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8882a5a80000
[ 604291.140749]   WARN: R13: 0000000001000000 R14: 0000000000000004 R15: 0000000000000000
[ 604291.140756]   WARN: FS:  0000000000000000(0000) GS:ffff8882b6800000(0000) knlGS:0000000000000000
[ 604291.140760]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 604291.140762]   WARN: CR2: 000000000100001c CR3: 000000000723e000 CR4: 0000000000040660
[ 604291.140768]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

It's a gaming motherboard with a rather cheap RealTek network interface and:

Your BIOS is very outdated (latest seems to be F31d, you are 3 years late
I would try again after that, at least the culprit is the rtl8125 driver when doing an interrupt on receiving a packet.

If you still have the problem after a BIOS update, we'll discuss what's next

bleader

@dnikola said in [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached:

Has anyone experienced similar page faults in the dom0 TCP stack on 4.19 kernels or XCP-ng 4.17.5?

Not that I know of.

Are there any known issues with network drivers on this kernel/hypervisor combo?

No, there can be issues with some drivers, you should have specified which network NICs and drivers you are using.

Would you recommend moving to a newer dom0 kernel or hypervisor build?

On XCP-ng, the latest version is 8.3 which you didn't specify in your post, but you're using the latest version of Xen, so I assume it is an up to date 8.3, so there is no newer build.

Could a memory issue cause this specific kind of page table inconsistency during a kernel panic?

Yes, it can be a bug in the the code, but it absolutely could be a hardware issues.

Any advice on additional debug steps or log files I should collect next time?

I would start by running a memtest on that host to make sure the memory is not having issues.

Do you know if there was a specific VM doing something specific at that time? We had some issues in the past with FreeBSD VMs using wireguard, but it does not look similar, and it should be fixed now.
What kind of guests were running on that host? linux, windows, some BSD based?
If running windows guests please be sure to have read this blog post and ensure to comply with the guidelines there.

From a quick look, I don't see anything obvious. Follow Olivier's suggestion first, if you still have issues after that, you can share an additional report using xen-bugtool -y. But please be sure to update your bios first, check your memory, and then do that.

dnikola

Hi @olivierlambert @bleader

thank you both again for the detailed replies and suggestions. I’d like to provide a bit more context about our setup and situation:

Situation Summary:
We’re currently running XCP-ng 8.3 with Xen 4.17.5-13 on a mix of servers, including some older, obsolete hardware.

Interestingly, XCP-ng 8.2 runs without issues on identical hardware configurations — no crashes, even under the same workloads.

On this particular host, we’ve experienced 10 crashes so far, and in almost every case the crash happened while performing delta backups from Xen Orchestra.
This seems to consistently trigger the issue under higher network load.

We’ve already performed full memory tests (memtest86+) on this host, and the results came back clean — no memory errors found.

The servers are currently physically located at a remote site, which makes immediate hands-on intervention difficult.
We’re organizing a visit to the site to update the BIOS and potentially replace the Realtek NIC with a supported Intel NIC as suggested. This intervention will happen as soon as logistically possible.

Question:
Is there anything else you would recommend we check or do remotely in the meantime before our on-site intervention?

And once we're physically on-site, aside from:

Updating the BIOS
Swapping NIC hardware

is there anything else you’d recommend we inspect or collect while we’re there?

I appreciate your help and guidance — and thank you again for pointing us in the right direction so quickly.

ONE more important question which guest tools do you recommend for Win server 2019, 2022, windows 10 ?
is 9.4.0 right one?

olivierlambert

If it worked with 8.2, it's potentially the version of the driver for the NIC, in general things can go well with specific firmware+driver version. Maybe you entered a different combo that's not great. So first, updating BIOS/firmware of the machine AND the NICs is likely the best next move (or swapping for a better NIC).

For the tools, @dinhngtu can provide guidance

dinhngtu

@dnikola Here are our driver recommendations:

For non-prod environments (homelab, test VMs, whenever possible): use the new XCP-ng drivers in testsign mode. We'd really appreciate having people to test the driver/guest agent and provide us with feedback.
For prod environments: use XenServer drivers. (9.4.1 or later to avoid the recent vulnerability)

dnikola

Hi, thanks for your kind replay.

Here is one more crash log from different server - identical hardware, identical problems.
What i have noticed before this crash, server has stuck on creating delta backups in 02:00 AM
and it had a few pending tasks returning in xo task-list command, was not accessible from xo and admin XCP-ng Center, and after XAPI - toolstack restart from ssh, connection restored but now I see that server restarted after that.

olivierlambert

Another system which is also a consumer grade motherboard, right?

Hardware name: ASUS System Product Name/PRIME Z790-P, BIOS 1663 08/08/2024

You BIOS is also outdated.

What NIC are you using in there?

dnikola

@olivierlambert

You BIOS is also outdated.

Yes, same BIOS

What NIC are you using in there?

Same mbo NIC, and there is one more NIC card used just for SIP trunk.

there is one more server which has less problems (because slow ISP, temporary backups has been disabled) and crash are not so frequent, but they happen... without crash log files... specially toolstack...

Regarding NIC, local seller has this 2.5gbps card, https://www.cudy.com/en-eu/products/pe25-1-0
would it be better?

olivierlambert

It's unrelated, are you really building a production infrastructure with non-server grade hardware? There's a good reason people don't use consumer-grade hardware: it's not meant for it, it works for basic usage but you can easily encounter buggy BIOS, firmware, ACPI tables and so on. It doesn't have the QA process done on server-grade hardware.

It's a LOT better to purchase refurb stuff (even a cheap refurb 10G Intel NIC will be 1000 times better than any RealTek crap you can purchase brand new).

dnikola

thanks for letting me know something that i already know
but from time to time situation is as it is, and we need to adapt to situation (lack of HW, lack of budget and etc...)

dnikola

@olivierlambert please let me know one or two model of nic card, a, i will purchase them over ebay because local seller would not be able to deliver them, nd have them just in case for future debugging process.

for last X years, till 8.3 we could put xcp on any damn hardware and never had any problem ... This is our experiance.

olivierlambert

As I said, you were lucky: it's a question of "not crashing" consumer grade hardware between a driver version and a firmware version (even on server-grade hardware it could happen, but it's simply less likely.

That's why I would advise to update first all the firmware first. The next step is to play with a alternative kernel driver for the NIC (if there's one), you could even start there if you like.

At least, you know how to trigger it (maybe a simple iperf would be enough) so you can test various drivers and/or firmware versions.

TeddyAstie

cc @andrew

It looks like an issue with https://github.com/xcp-ng-rpms/r8125-module, though I am not completely sure what is going on, and why the pagetable suddently gets invalid.

Andrew

@TeddyAstie It could be! The crash does look like a r8125 driver issue.

It's older Realtek code that has been working well on XCP systems. The newer current Realtek released code has new issues, so there's no quick direct update... I have not seen the current r8125 XCP driver cause crashes.

I would point my finger back at this specific system and some odd condition that the driver does not handle correctly.

As this is vendor code, there is no upstream Linux testing.... so, no non-XCP problem reports.

olivierlambert

Yeah I would avoid those shitty cards as possible. Is there any more recent driver anyway?

Andrew

@olivierlambert There are new versions of the r812x drivers. They don't compile cleanly for XCP. The r8127 driver was withdrawn and the r8125/r8126 was split into two different drivers. Realtek never publishes release notes.

I'll have to test the new driver and see if it's worth trying. I don't know if and update would solve this panic issue as there are lots of undocumented code changes.

The forum has been quiet about new r8125 issues, so in general the current driver has been working well enough. Just two issues I remember, including this one.

Realtek has also released new hardware revisions of the r812x chips that need new driver support and are only recently supported by their vendor driver and in upstream Linux 6.15 (not even 6.12 LTS yet).

As for the r8127, it looks like it could be a desktop game changer as it's a small cheap low power 10G chip. But like the others, its release is delayed and it does not have driver support yet (or test samples).

olivierlambert

That would be interesting if a we have a test driver solving this very issue here, but I wouldn't expect too much either Those drivers are as bad as the chips

Andrew

@dnikola Please make sure your motherboard firmware is up to date (BIOS F30e). There are a LOT of stability issues with Intel CPUs for that board and old BIOS.

If you still have r8125 crashes, then try a newer r8125 alt version (9.016.00) from my download page and see if it works better. I gave it a quick test and it installs and works, but YMMV... You can always uninstall it.

Andrew

@dnikola As for the other card you listed, no, it's still a 8125 card. The single port 10G card (from the same site) is a AQC113 chipset, you'll need to install the atlantic-module-alt to support it. If you must have 2.5G then the Intel i225/i226 card is the other choice (not from that site).