XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached

    Scheduled Pinned Locked Moved XCP-ng
    3 Posts 3 Posters 33 Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      dnikola
      last edited by

      Hi everyone,

      I'm reaching out to ask for help analyzing a recent crash on one of our XCP-ng hosts.
      We experienced a dom0 kernel panic caused by a page fault in the TCP stack. I’ve collected and parsed the crash dumps and would appreciate your feedback, recommendations, or confirmation whether this is a known issue.

      📌 Environment:
      XCP-ng hypervisor version: 4.17.5-13
      Dom0 kernel: 4.19.0+1
      CPU: Intel
      24 physical CPUs (PCPUs)

      Crashkernel configured and working via kexec

      📊 Crash summary:
      The crash occurred due to a page fault inside the TCP stack of dom0 kernel at virtual address:

      0xffff8882b6ed0000 - level 1 page table not present
      This triggered an NMI (Non-Maskable Interrupt) and crash dump via kexec.

      📑 dom0.log details:
      WARN paging error for vaddr 0xffff8882b6ed0000 - level 1 not present

      Call Trace:
      [ffffffff81071fa5] panic+0x111/0x27c
      [ffffffff8102796f] oops_end+0xcf/0xd0
      [ffffffff8105da73] no_context+0x1b3/0x3c0
      [ffffffff8169885c] tcp_check_space+0x4c/0xf0
      [ffffffff8105e33a] __do_page_fault+0xaa/0x4f0
      Indicates page fault in tcp_check_space() leading to kernel panic.

      📑 xen.log details:
      Xen hypervisor correctly triggered NMI crash handling across all PCPUs.
      Key stack trace on all CPUs:

      kexec_crash_save_cpu()
      do_nmi_crash()
      Most CPUs were in idle loops (xen_safe_halt), while dom0 VCPU0 crashed on PCPU20.

      📑 xen-crashdump-analyser output:
      Several warnings like:

      WARN Cannot get kernel page table address - VCPU assumed down
      indicating missing page tables at crash time for multiple VCPUs.

      Confirmed same paging error on the same virtual address:

      WARN paging error for vaddr 0xffff8882b6ed0000 - level 1 not present
      Dom0 had VCPUs active on PCPU18, PCPU0, PCPU11, PCPU2 at crash time.

      Several guest VMs had VCPUs active, suggesting moderate to high workload.

      📌 Summary:
      It seems to be a page fault on a virtual memory address inside the dom0 TCP stack that caused a panic.
      The crashdump shows memory mapping inconsistencies (missing page tables) for various VCPUs after the crash was triggered.
      This might suggest a kernel bug, unstable network driver, or potential hardware-related issue like sporadic memory corruption.

      📌 Questions:

      • Has anyone experienced similar page faults in the dom0 TCP stack on 4.19 kernels or XCP-ng 4.17.5?
      • Are there any known issues with network drivers on this kernel/hypervisor combo?
      • Would you recommend moving to a newer dom0 kernel or hypervisor build?
      • Could a memory issue cause this specific kind of page table inconsistency during a kernel panic?
      • Any advice on additional debug steps or log files I should collect next time?

      📦 Download crashdump package

      Thank you all in advance for your time and input — really appreciate it!

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        It's a lot easier just to read the right section in the dmeg:

        [ 604291.140493]  ALERT: BUG: unable to handle kernel paging request at 000000000100001c
        [ 604291.140503]   INFO: PGD 10b740067 P4D 10b740067 PUD 107933067 PMD 0 
        [ 604291.140509]   WARN: Oops: 0000 [#1] SMP NOPTI
        [ 604291.140513]   WARN: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O      4.19.0+1 #1
        [ 604291.140516]   WARN: Hardware name: Gigabyte Technology Co., Ltd. Z690 UD/Z690 UD, BIOS F6 01/27/2022
        [ 604291.140524]   WARN: RIP: e030:rtl8125_rx_interrupt+0x2ae/0x7d0 [r8125]
        [ 604291.140527]   WARN: Code: 53 14 74 50 48 8b 4c 24 60 65 48 33 0c 25 28 00 00 00 44 89 e8 0f 85 64 04 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <45> 8b 75 1c e9 fe fd ff ff f6 85 e9 35 00 00 02 0f 85 68 04 00 00
        [ 604291.140534]   WARN: RSP: e02b:ffff8882b6803e40 EFLAGS: 00010246
        [ 604291.140537]   WARN: RAX: 0000000000000000 RBX: ffff8882a5aa0068 RCX: 00000000684ce000
        [ 604291.140540]   WARN: RDX: 0000000000000000 RSI: 0000000001000000 RDI: 0000000000044211
        [ 604291.140543]   WARN: RBP: ffff8882a5a808c0 R08: 00000000000531a1 R09: 000000001458e000
        [ 604291.140546]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8882a5a80000
        [ 604291.140550]   WARN: R13: 0000000001000000 R14: 0000000000000004 R15: 0000000000000000
        [ 604291.140558]   WARN: FS:  0000000000000000(0000) GS:ffff8882b6800000(0000) knlGS:0000000000000000
        [ 604291.140562]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 604291.140565]   WARN: CR2: 000000000100001c CR3: 000000000723e000 CR4: 0000000000040660
        [ 604291.140570]   WARN: Call Trace:
        [ 604291.140574]   WARN:  <IRQ>
        [ 604291.140580]   WARN:  ? __napi_schedule+0x4a/0x50
        [ 604291.140584]   WARN:  ? rtl8125_interrupt_msix+0x62/0xe0 [r8125]
        [ 604291.140589]   WARN:  ? __handle_irq_event_percpu+0x4d/0x1a0
        [ 604291.140593]   WARN:  rtl8125_poll_msix_rx+0x4a/0x90 [r8125]
        [ 604291.140596]   WARN:  net_rx_action+0x2a5/0x3e0
        [ 604291.140600]   WARN:  __do_softirq+0xd1/0x28c
        [ 604291.140604]   WARN:  irq_exit+0xa8/0xc0
        [ 604291.140608]   WARN:  xen_evtchn_do_upcall+0x2c/0x50
        [ 604291.140611]   WARN:  xen_do_hypervisor_callback+0x29/0x40
        [ 604291.140614]   WARN:  </IRQ>
        [ 604291.140617]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
        [ 604291.140620]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
        [ 604291.140626]   WARN: RSP: e02b:ffffffff82003e58 EFLAGS: 00000246
        [ 604291.140629]   WARN: RAX: 0000000000000000 RBX: ffffffff82011740 RCX: ffffffff810013aa
        [ 604291.140632]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
        [ 604291.140635]   WARN: RBP: 0000000000000000 R08: 0000000000000008 R09: 000225a01b5e9188
        [ 604291.140639]   WARN: R10: ffffc9004178b930 R11: 0000000000000246 R12: 0000000000000000
        [ 604291.140642]   WARN: R13: 0000000000000000 R14: ffffffff82011740 R15: ffffffff82011740
        [ 604291.140646]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
        [ 604291.140650]   WARN:  ? xen_safe_halt+0xc/0x20
        [ 604291.140653]   WARN:  ? default_idle+0x1a/0x140
        [ 604291.140656]   WARN:  ? do_idle+0x1ea/0x260
        [ 604291.140659]   WARN:  ? cpu_startup_entry+0x6f/0x80
        [ 604291.140663]   WARN:  ? start_kernel+0x558/0x578
        [ 604291.140666]   WARN:  ? set_init_arg+0x55/0x55
        [ 604291.140668]   WARN:  ? xen_start_kernel+0x5a0/0x5aa
        [ 604291.140671]   WARN: Modules linked in: tun bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc nfnetlink_cttimeout nfnetlink openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter sunrpc nls_iso8859_1 nls_cp437 vfat fat hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc usbhid hid dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg video backlight ip_tables x_tables raid1 sd_mod md_mod ahci libahci nvme xhci_pci r8125(O) nvme_core libata xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
        [ 604291.140714]   WARN: CR2: 000000000100001c
        [ 604291.140720]   WARN: ---[ end trace d652c60ff1bf708b ]---
        [ 604291.140724]   WARN: RIP: e030:rtl8125_rx_interrupt+0x2ae/0x7d0 [r8125]
        [ 604291.140727]   WARN: Code: 53 14 74 50 48 8b 4c 24 60 65 48 33 0c 25 28 00 00 00 44 89 e8 0f 85 64 04 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <45> 8b 75 1c e9 fe fd ff ff f6 85 e9 35 00 00 02 0f 85 68 04 00 00
        [ 604291.140733]   WARN: RSP: e02b:ffff8882b6803e40 EFLAGS: 00010246
        [ 604291.140736]   WARN: RAX: 0000000000000000 RBX: ffff8882a5aa0068 RCX: 00000000684ce000
        [ 604291.140739]   WARN: RDX: 0000000000000000 RSI: 0000000001000000 RDI: 0000000000044211
        [ 604291.140743]   WARN: RBP: ffff8882a5a808c0 R08: 00000000000531a1 R09: 000000001458e000
        [ 604291.140746]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: ffff8882a5a80000
        [ 604291.140749]   WARN: R13: 0000000001000000 R14: 0000000000000004 R15: 0000000000000000
        [ 604291.140756]   WARN: FS:  0000000000000000(0000) GS:ffff8882b6800000(0000) knlGS:0000000000000000
        [ 604291.140760]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 604291.140762]   WARN: CR2: 000000000100001c CR3: 000000000723e000 CR4: 0000000000040660
        [ 604291.140768]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
        

        It's a gaming motherboard with a rather cheap RealTek network interface and:

        1. Your BIOS is very outdated (latest seems to be F31d, you are 3 years late
        2. I would try again after that, at least the culprit is the rtl8125 driver when doing an interrupt on receiving a packet.

        If you still have the problem after a BIOS update, we'll discuss what's next 🙂

        1 Reply Last reply Reply Quote 0
        • bleaderB Offline
          bleader Vates 🪐 XCP-ng Team
          last edited by

          @dnikola said in [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached:

          Has anyone experienced similar page faults in the dom0 TCP stack on 4.19 kernels or XCP-ng 4.17.5?

          Not that I know of.

          Are there any known issues with network drivers on this kernel/hypervisor combo?

          No, there can be issues with some drivers, you should have specified which network NICs and drivers you are using.

          Would you recommend moving to a newer dom0 kernel or hypervisor build?

          On XCP-ng, the latest version is 8.3 which you didn't specify in your post, but you're using the latest version of Xen, so I assume it is an up to date 8.3, so there is no newer build.

          Could a memory issue cause this specific kind of page table inconsistency during a kernel panic?

          Yes, it can be a bug in the the code, but it absolutely could be a hardware issues.

          Any advice on additional debug steps or log files I should collect next time?

          I would start by running a memtest on that host to make sure the memory is not having issues.

          Do you know if there was a specific VM doing something specific at that time? We had some issues in the past with FreeBSD VMs using wireguard, but it does not look similar, and it should be fixed now.
          What kind of guests were running on that host? linux, windows, some BSD based?
          If running windows guests please be sure to have read this blog post and ensure to comply with the guidelines there.

          From a quick look, I don't see anything obvious. Follow Olivier's suggestion first, if you still have issues after that, you can share an additional report using xen-bugtool -y. But please be sure to update your bios first, check your memory, and then do that.

          1 Reply Last reply Reply Quote 0
          • First post
            Last post