XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached

    Scheduled Pinned Locked Moved XCP-ng
    23 Posts 6 Posters 321 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D Offline
      dnikola
      last edited by

      Hi @olivierlambert @bleader

      thank you both again for the detailed replies and suggestions. I’d like to provide a bit more context about our setup and situation:

      📌 Situation Summary:
      We’re currently running XCP-ng 8.3 with Xen 4.17.5-13 on a mix of servers, including some older, obsolete hardware.

      Interestingly, XCP-ng 8.2 runs without issues on identical hardware configurations — no crashes, even under the same workloads.

      On this particular host, we’ve experienced 10 crashes so far, and in almost every case the crash happened while performing delta backups from Xen Orchestra.
      This seems to consistently trigger the issue under higher network load.

      We’ve already performed full memory tests (memtest86+) on this host, and the results came back clean — no memory errors found.

      The servers are currently physically located at a remote site, which makes immediate hands-on intervention difficult.
      We’re organizing a visit to the site to update the BIOS and potentially replace the Realtek NIC with a supported Intel NIC as suggested. This intervention will happen as soon as logistically possible.

      📌 Question:
      Is there anything else you would recommend we check or do remotely in the meantime before our on-site intervention?

      And once we're physically on-site, aside from:

      Updating the BIOS
      Swapping NIC hardware

      is there anything else you’d recommend we inspect or collect while we’re there?

      I appreciate your help and guidance — and thank you again for pointing us in the right direction so quickly.

      ONE more important question which guest tools do you recommend for Win server 2019, 2022, windows 10 ?
      is 9.4.0 right one?

      D 1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        If it worked with 8.2, it's potentially the version of the driver for the NIC, in general things can go well with specific firmware+driver version. Maybe you entered a different combo that's not great. So first, updating BIOS/firmware of the machine AND the NICs is likely the best next move (or swapping for a better NIC).

        For the tools, @dinhngtu can provide guidance

        1 Reply Last reply Reply Quote 0
        • D Offline
          dinhngtu Vates 🪐 XCP-ng Team @dnikola
          last edited by

          @dnikola Here are our driver recommendations:

          • For non-prod environments (homelab, test VMs, whenever possible): use the new XCP-ng drivers in testsign mode. We'd really appreciate having people to test the driver/guest agent and provide us with feedback.
          • For prod environments: use XenServer drivers. (9.4.1 or later to avoid the recent vulnerability)
          1 Reply Last reply Reply Quote 1
          • D Offline
            dnikola
            last edited by

            Hi, thanks for your kind replay.

            Here is one more crash log from different server - identical hardware, identical problems.
            What i have noticed before this crash, server has stuck on creating delta backups in 02:00 AM
            and it had a few pending tasks returning in xo task-list command, was not accessible from xo and admin XCP-ng Center, and after XAPI - toolstack restart from ssh, connection restored but now I see that server restarted after that.

            1 Reply Last reply Reply Quote 0
            • olivierlambertO Offline
              olivierlambert Vates 🪐 Co-Founder CEO
              last edited by

              Another system which is also a consumer grade motherboard, right?

              Hardware name: ASUS System Product Name/PRIME Z790-P, BIOS 1663 08/08/2024

              You BIOS is also outdated.

              What NIC are you using in there?

              D 1 Reply Last reply Reply Quote 0
              • D Offline
                dnikola @olivierlambert
                last edited by dnikola

                @olivierlambert

                You BIOS is also outdated.

                Yes, same BIOS

                What NIC are you using in there?

                Same mbo NIC, and there is one more NIC card used just for SIP trunk.

                there is one more server which has less problems (because slow ISP, temporary backups has been disabled) and crash are not so frequent, but they happen... without crash log files... specially toolstack...
                62825806-75f2-41b3-8b46-908f016ca816-image.png

                Regarding NIC, local seller has this 2.5gbps card, https://www.cudy.com/en-eu/products/pe25-1-0
                would it be better?

                D A 2 Replies Last reply Reply Quote 0
                • olivierlambertO Offline
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by olivierlambert

                  It's unrelated, are you really building a production infrastructure with non-server grade hardware? There's a good reason people don't use consumer-grade hardware: it's not meant for it, it works for basic usage but you can easily encounter buggy BIOS, firmware, ACPI tables and so on. It doesn't have the QA process done on server-grade hardware.

                  It's a LOT better to purchase refurb stuff (even a cheap refurb 10G Intel NIC will be 1000 times better than any RealTek crap you can purchase brand new).

                  1 Reply Last reply Reply Quote 0
                  • D Offline
                    dnikola @dnikola
                    last edited by

                    thanks for letting me know something that i already know 🙂
                    but from time to time situation is as it is, and we need to adapt to situation (lack of HW, lack of budget and etc...)

                    D 1 Reply Last reply Reply Quote 0
                    • D Offline
                      dnikola @dnikola
                      last edited by

                      @olivierlambert please let me know one or two model of nic card, a, i will purchase them over ebay because local seller would not be able to deliver them, nd have them just in case for future debugging process.

                      for last X years, till 8.3 we could put xcp on any damn hardware and never had any problem ... This is our experiance.

                      A 1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        As I said, you were lucky: it's a question of "not crashing" consumer grade hardware between a driver version and a firmware version (even on server-grade hardware it could happen, but it's simply less likely.

                        That's why I would advise to update first all the firmware first. The next step is to play with a alternative kernel driver for the NIC (if there's one), you could even start there if you like.

                        At least, you know how to trigger it (maybe a simple iperf would be enough) so you can test various drivers and/or firmware versions.

                        1 Reply Last reply Reply Quote 0
                        • TeddyAstieT Offline
                          TeddyAstie Vates 🪐 XCP-ng Team Xen Guru
                          last edited by

                          cc @andrew

                          It looks like an issue with https://github.com/xcp-ng-rpms/r8125-module, though I am not completely sure what is going on, and why the pagetable suddently gets invalid.

                          A 1 Reply Last reply Reply Quote 0
                          • A Online
                            Andrew Top contributor @TeddyAstie
                            last edited by

                            @TeddyAstie It could be! The crash does look like a r8125 driver issue.

                            It's older Realtek code that has been working well on XCP systems. The newer current Realtek released code has new issues, so there's no quick direct update... I have not seen the current r8125 XCP driver cause crashes.

                            I would point my finger back at this specific system and some odd condition that the driver does not handle correctly.

                            As this is vendor code, there is no upstream Linux testing.... so, no non-XCP problem reports.

                            1 Reply Last reply Reply Quote 0
                            • olivierlambertO Offline
                              olivierlambert Vates 🪐 Co-Founder CEO
                              last edited by

                              Yeah I would avoid those shitty cards as possible. Is there any more recent driver anyway?

                              A 1 Reply Last reply Reply Quote 0
                              • A Online
                                Andrew Top contributor @olivierlambert
                                last edited by

                                @olivierlambert There are new versions of the r812x drivers. They don't compile cleanly for XCP. The r8127 driver was withdrawn and the r8125/r8126 was split into two different drivers. Realtek never publishes release notes.

                                I'll have to test the new driver and see if it's worth trying. I don't know if and update would solve this panic issue as there are lots of undocumented code changes.

                                The forum has been quiet about new r8125 issues, so in general the current driver has been working well enough. Just two issues I remember, including this one.

                                Realtek has also released new hardware revisions of the r812x chips that need new driver support and are only recently supported by their vendor driver and in upstream Linux 6.15 (not even 6.12 LTS yet).

                                As for the r8127, it looks like it could be a desktop game changer as it's a small cheap low power 10G chip. But like the others, its release is delayed and it does not have driver support yet (or test samples).

                                1 Reply Last reply Reply Quote 1
                                • olivierlambertO Offline
                                  olivierlambert Vates 🪐 Co-Founder CEO
                                  last edited by

                                  That would be interesting if a we have a test driver solving this very issue here, but I wouldn't expect too much either 😕 Those drivers are as bad as the chips 😢

                                  D 1 Reply Last reply Reply Quote 0
                                  • A Online
                                    Andrew Top contributor @dnikola
                                    last edited by

                                    @dnikola Please make sure your motherboard firmware is up to date (BIOS F30e). There are a LOT of stability issues with Intel CPUs for that board and old BIOS.

                                    If you still have r8125 crashes, then try a newer r8125 alt version (9.016.00) from my download page and see if it works better. I gave it a quick test and it installs and works, but YMMV... You can always uninstall it.

                                    1 Reply Last reply Reply Quote 0
                                    • A Online
                                      Andrew Top contributor @dnikola
                                      last edited by

                                      @dnikola As for the other card you listed, no, it's still a 8125 card. The single port 10G card (from the same site) is a AQC113 chipset, you'll need to install the atlantic-module-alt to support it. If you must have 2.5G then the Intel i225/i226 card is the other choice (not from that site).

                                      D 1 Reply Last reply Reply Quote 0
                                      • D Offline
                                        dnikola @Andrew
                                        last edited by dnikola

                                        Hi, thanks for your kind replay.

                                        Let me share what i have noticed
                                        3 servers with same hardware, same XCP-ng last version, same MBO bios..
                                        1 server different Hardware, same XCP-ng last version

                                        Server C
                                        not so frequent crashes but yes it happens from time to time around 10 days, and toolstack restarts few times in that 10 days

                                        5 VM: server 2022 x3, server 2019, win 10 pro
                                        delta backup disabled but connected to XO

                                        Server D
                                        2 VM: server 2019, server 2022
                                        delta backup enabled, everything running fine from first day, not any single problem restart or toolstack crash

                                        Server K
                                        5 VM: server 2019, server 2022, win 10 pro, win 7, linux
                                        make the most problems, it works for 10 days than restarts 10 times in 2 days... it was triggered after delta backups, so I have disabled delta backups and disabled sending metrics from server

                                        Server P
                                        5 VM: server 2019, server 2022 x 2, win 10 pro, win 7
                                        not so frequent crashes but yes it happens from time to time around 10 days, and toolstack restarts few times in that 10 days
                                        delta backup enabled, works fine, but few time restarts occurred in that time

                                        From reviewing dom0.log from server K as most affected one we have noticed:

                                        Multiple segfaults in xcp-rrdd throughout runtime:

                                        INFO: xcp-rrdd[xxx]: segfault at ...
                                        

                                        The RRD polling is active and seems unstable on this host.

                                        Frequent link down/up events from the r8125 driver:

                                        INFO: r8125: eth0: link down
                                        INFO: r8125: eth0: link up
                                        

                                        (known issue on Xen hypervisors with Realtek drivers)

                                        And eventually, identical kernel panics as before:

                                        CRIT: kernel BUG at drivers/xen/events/events_base.c:1601!
                                        Kernel panic - not syncing: Fatal exception in interrupt
                                        Always same stack trace, same event channel handling failure.
                                        

                                        📌 Actions planned:
                                        BIOS update on-site (currently on v1663 / Aug 2024 — latest is 1854)
                                        Evaluate replacing the Realtek NIC with an Intel one
                                        Problem is that the server is at a remote location, and we’re organizing an on-site intervention ASAP.

                                        📌 In the meantime:
                                        Can I safely disable xcp-rrdd service to reduce polling activity?
                                        I know it powers the RRD stats in XO and XenCenter, but we can live without the graphs for now.

                                        Is there anything else advisable to disable / adjust until we get on-site?
                                        (delta backups are already paused on this)

                                        The VM involved during the latest crash was a FreePBX virtual machine running management agent version 8.4.
                                        Is there a newer agent package available for CentOS/AlmaLinux 8/9 guests I should apply?

                                        📌 Question:

                                        • Would disabling xcp-rrdd mitigate dom0 instability short-term?
                                        • Is there any way to tune RRD polling frequency instead of disabling entirely?
                                        • Anything else to collect before the next crash (besides xen-bugtool -y) you’d recommend?

                                        I also noticed that my FreePBX VM (UUID: 6c725208-c266-a106-da10-50e9ec66b41e) repeatedly triggers an event processing loop via xenopsd-xc and xapi, visible both in dom0.log and xapi.log.

                                        Example from logs:

                                        Received an event on managed VM 6c725208-c266-a106-da10-50e9ec66b41e
                                        Queue.push ["VM_check_state","6c725208-c266-a106-da10-50e9ec66b41e"]
                                        Queue.pop returned ["VM_check_state","6c725208-c266-a106-da10-50e9ec66b41e"]
                                        VM 6c725208-c266-a106-da10-50e9ec66b41e is not requesting any attention
                                        

                                        This repeats every minute, without an actual task being created (confirmed via xe task-list showing no pending tasks).

                                        Notably:

                                        • This behavior persists even after disabling RRD polling and delta backups
                                        • The VM shows an orange activity indicator in XCP-ng Admin Center, as if a task is ongoing
                                        • Previously this has caused a dom0 crash and reboot
                                        • Given the log pattern and event storm, it seems likely that either:
                                        • A stale or looping event is being triggered by the guest agent or hypervisor integration
                                        • Or xenopsd/xapi state machine isn't properly clearing or marking the VM state after these checks

                                        I'd appreciate advice on:

                                        • How to safely clear/reset the VM state without restarting dom0
                                        • Whether updating the management agent inside the FreePBX guest (currently xcp-ng-agent 8.4) to a newer version might resolve this
                                          (If a newer one is available for RHEL7/FreePBX)

                                        Part of log in time of this happening

                                        Thanks in advance — we’re pushing for the hardware fixes but would appreciate advice for short-term stability in the meantime.

                                        1 Reply Last reply Reply Quote 0
                                        • D Offline
                                          dnikola @olivierlambert
                                          last edited by

                                          @Andrew said in [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached:

                                          @dnikola Please make sure your motherboard firmware is up to date (BIOS F30e). There are a LOT of stability issues with Intel CPUs for that board and old BIOS.

                                          If you still have r8125 crashes, then try a newer r8125 alt version (9.016.00) from my download page and see if it works better. I gave it a quick test and it installs and works, but YMMV... You can always uninstall it.

                                          Ok, that will be done and I will report!
                                          is it possible to have some quick user guide what has to be done, in which order to process with correct install - uninstall process

                                          @Andrew said in [HELP] XCP-ng 4.17.5 dom0 kernel panic — page fault in TCP stack, crashdump attached:

                                          @dnikola As for the other card you listed, no, it's still a 8125 card. The single port 10G card (from the same site) is a AQC113 chipset, you'll need to install the atlantic-module-alt to support it. If you must have 2.5G then the Intel i225/i226 card is the other choice (not from that site).

                                          I appreciate all your help so far — thank you. I noticed that the only 2.5G NIC currently available locally is the one with 8125, so it was "first aid", but didn't work . Since I’ll likely need to order a replacement online (not possible to find it in our country without purchase), could you kindly recommend a reliable source or a specific NIC model (our ISP is 2,5gbps so i prefer 2,5+ card) you’d personally suggest for this purpose (eBay or what every)?

                                          Of course, this would be just an informal recommendation — I fully respect your experience and advice, and I completely understand it wouldn’t imply any obligation or responsibility on your part for any potential purchase issues or problems later.

                                          my second option is MBO replacement with intel NIC (local wholesale have few models on stock) and it will be maybe fastest option

                                          • ASUS PRIME Z790-A WIFI
                                          • MSI PRO Z790-P WIFI
                                          • GIGABYTE Z790 AERO G rev. 1.x
                                          • MSI Z790 GAMING PLUS WIFI

                                          Thanks again in advance — any tip would be much appreciated.

                                          A 1 Reply Last reply Reply Quote 0
                                          • A Online
                                            Andrew Top contributor @dnikola
                                            last edited by

                                            @dnikola The AQC113 10G card (from your vendor) also support 2.5G with the driver loaded.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post