Hi, thanks for your kind replay.
Let me share what i have noticed
3 servers with same hardware, same XCP-ng last version, same MBO bios..
1 server different Hardware, same XCP-ng last version
Server C
not so frequent crashes but yes it happens from time to time around 10 days, and toolstack restarts few times in that 10 days
5 VM: server 2022 x3, server 2019, win 10 pro
delta backup disabled but connected to XO
Server D
2 VM: server 2019, server 2022
delta backup enabled, everything running fine from first day, not any single problem restart or toolstack crash
Server K
5 VM: server 2019, server 2022, win 10 pro, win 7, linux
make the most problems, it works for 10 days than restarts 10 times in 2 days... it was triggered after delta backups, so I have disabled delta backups and disabled sending metrics from server
Server P
5 VM: server 2019, server 2022 x 2, win 10 pro, win 7
not so frequent crashes but yes it happens from time to time around 10 days, and toolstack restarts few times in that 10 days
delta backup enabled, works fine, but few time restarts occurred in that time
From reviewing dom0.log from server K as most affected one we have noticed:
Multiple segfaults in xcp-rrdd throughout runtime:
INFO: xcp-rrdd[xxx]: segfault at ...
The RRD polling is active and seems unstable on this host.
Frequent link down/up events from the r8125 driver:
INFO: r8125: eth0: link down
INFO: r8125: eth0: link up
(known issue on Xen hypervisors with Realtek drivers)
And eventually, identical kernel panics as before:
CRIT: kernel BUG at drivers/xen/events/events_base.c:1601!
Kernel panic - not syncing: Fatal exception in interrupt
Always same stack trace, same event channel handling failure.
Actions planned:
BIOS update on-site (currently on v1663 / Aug 2024 — latest is 1854)
Evaluate replacing the Realtek NIC with an Intel one
Problem is that the server is at a remote location, and we’re organizing an on-site intervention ASAP.
In the meantime:
Can I safely disable xcp-rrdd service to reduce polling activity?
I know it powers the RRD stats in XO and XenCenter, but we can live without the graphs for now.
Is there anything else advisable to disable / adjust until we get on-site?
(delta backups are already paused on this)
The VM involved during the latest crash was a FreePBX virtual machine running management agent version 8.4.
Is there a newer agent package available for CentOS/AlmaLinux 8/9 guests I should apply?
Question:
- Would disabling xcp-rrdd mitigate dom0 instability short-term?
- Is there any way to tune RRD polling frequency instead of disabling entirely?
- Anything else to collect before the next crash (besides xen-bugtool -y) you’d recommend?
I also noticed that my FreePBX VM (UUID: 6c725208-c266-a106-da10-50e9ec66b41e) repeatedly triggers an event processing loop via xenopsd-xc and xapi, visible both in dom0.log and xapi.log.
Example from logs:
Received an event on managed VM 6c725208-c266-a106-da10-50e9ec66b41e
Queue.push ["VM_check_state","6c725208-c266-a106-da10-50e9ec66b41e"]
Queue.pop returned ["VM_check_state","6c725208-c266-a106-da10-50e9ec66b41e"]
VM 6c725208-c266-a106-da10-50e9ec66b41e is not requesting any attention
This repeats every minute, without an actual task being created (confirmed via xe task-list showing no pending tasks).
Notably:
- This behavior persists even after disabling RRD polling and delta backups
- The VM shows an orange activity indicator in XCP-ng Admin Center, as if a task is ongoing
- Previously this has caused a dom0 crash and reboot
- Given the log pattern and event storm, it seems likely that either:
- A stale or looping event is being triggered by the guest agent or hypervisor integration
- Or xenopsd/xapi state machine isn't properly clearing or marking the VM state after these checks
I'd appreciate advice on:
- How to safely clear/reset the VM state without restarting dom0
- Whether updating the management agent inside the FreePBX guest (currently xcp-ng-agent 8.4) to a newer version might resolve this
(If a newer one is available for RHEL7/FreePBX)
Part of log in time of this happening
Thanks in advance — we’re pushing for the hardware fixes but would appreciate advice for short-term stability in the meantime.