Environment: XCP-NG 8.2 with a few debian vms that squeeze 100% of the host CPUs. 2 SSD disks in raid 1.
Have anyone seen such errors in XCP kern.log?
May 3 10:22:27 xcpfuz135 kernel: [760888.661906] tapdisk[18876]: segfault at 7f2fc3ca9630 ip 00007f2fc3ca9630 sp 00007ffd030e3888 error 14
May 3 10:22:27 xcpfuz135 kernel: [760888.661918] Code: Bad RIP value.
May 3 11:58:13 xcpfuz135 kernel: [766634.433681] xcp-rrdd-xenpm[1492]: segfault at 7f316d5187d8 ip 00007f316d5187d8 sp 00007fff780db5b8 error 15
May 3 11:58:13 xcpfuz135 kernel: [766634.433687] Code: 00 00 02 04 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 d8 87 51 6d 31 7f 00 00 00 14 00 00 00 00 00 00 <08> 88 51 6d 31 7f 00 00 20 88 51 6d 31 7f 00 00 c8 88 51 6d 31 7f
May 3 12:25:52 xcpfuz135 kernel: [768293.718628] tapdisk[21047]: segfault at 7f625725d630 ip 00007f625725d630 sp 00007ffe541135e8 error 14 in zero[7fb8921de000+202000]
May 3 12:25:52 xcpfuz135 kernel: [768293.718646] Code: Bad RIP value.
May 3 13:20:43 xcpfuz135 kernel: [771584.713117] xapi4: port 2(vif2.0) entered disabled state
May 3 13:20:43 xcpfuz135 kernel: [771585.015102] xapi4: port 2(vif2.0) entered disabled state
May 3 13:20:43 xcpfuz135 kernel: [771585.015571] device vif2.0 left promiscuous mode
May 3 13:20:43 xcpfuz135 kernel: [771585.015588] xapi4: port 2(vif2.0) entered disabled state
May 3 13:21:44 xcpfuz135 kernel: [771645.993918] block tdc: sector-size: 512/512 capacity: 209715200
May 3 13:21:45 xcpfuz135 kernel: [771646.097319] block tdg: sector-size: 512/512 capacity: 2097152
May 3 13:21:45 xcpfuz135 kernel: [771646.098822] block tdh: sector-size: 512/512 capacity: 62914560
May 3 13:21:45 xcpfuz135 kernel: [771646.755689] xapi4: port 2(vif4.0) entered blocking state
May 3 13:21:45 xcpfuz135 kernel: [771646.755693] xapi4: port 2(vif4.0) entered disabled state
May 3 13:21:45 xcpfuz135 kernel: [771646.755853] device vif4.0 entered promiscuous mode
May 3 13:21:45 xcpfuz135 kernel: [771646.971247] xapi4: port 3(tap4.0) entered blocking state
May 3 13:21:45 xcpfuz135 kernel: [771646.971251] xapi4: port 3(tap4.0) entered disabled state
May 3 13:21:45 xcpfuz135 kernel: [771646.971373] device tap4.0 entered promiscuous mode
May 3 13:21:45 xcpfuz135 kernel: [771646.978912] xapi4: port 3(tap4.0) entered blocking state
May 3 13:21:45 xcpfuz135 kernel: [771646.978914] xapi4: port 3(tap4.0) entered forwarding state
May 3 13:22:07 xcpfuz135 kernel: [771668.627952] xapi4: port 3(tap4.0) entered disabled stat
May 3 13:22:07 xcpfuz135 kernel: [771668.628380] device tap4.0 left promiscuous mode
May 3 13:22:07 xcpfuz135 kernel: [771668.628415] xapi4: port 3(tap4.0) entered disabled state
May 3 13:22:19 xcpfuz135 kernel: [771680.603894] vif vif-4-0 vif4.0: Guest Rx ready
May 3 13:22:19 xcpfuz135 kernel: [771680.604048] xapi4: port 2(vif4.0) entered blocking state
May 3 13:22:19 xcpfuz135 kernel: [771680.604050] xapi4: port 2(vif4.0) entered forwarding state
May 3 15:49:08 xcpfuz135 kernel: [780489.913642] tapdisk[26502]: segfault at 7f2fc3ca9630 ip 00007f2fc3ca9630 sp 00007ffe201f1ac8 error 14 in zero (deleted)[7f3f31b69000+20000]
May 3 15:49:08 xcpfuz135 kernel: [780489.913660] Code: Bad RIP value.
May 3 15:54:57 xcpfuz135 kernel: [780838.914536] tapdisk[19705]: segfault at 7fff1e11cc50 ip 00007fff1e11cc50 sp 00007ffc52bbdf08 error 14
May 3 15:54:57 xcpfuz135 kernel: [780838.914549] Code: Bad RIP value.
May 3 15:55:01 xcpfuz135 kernel: [780842.957102] tapdisk[26429]: segfault at 7f0668b77630 ip 00007f0668b77630 sp 00007ffc2dfedfb8 error 14 in zero[7f14475ad000+202000]
May 3 15:55:01 xcpfuz135 kernel: [780842.957123] Code: Bad RIP value.
There are issues in the VM instances - they are up but systemd commands do not work anymore (ps aux, top, choose your pick).That started after the segfault issues above. Some instance logs:
[12561.047902] systemd[1]: systemd-journald.service: Killing process 75421 (systemd-journal) with signal SIGKILL.
[12561.062143] systemd[1]: systemd-journald.service: Killing process 75425 (systemd-journal) with signal SIGKILL.
[12561.076665] systemd[1]: systemd-journald.service: Killing process 75428 (systemd-journal) with signal SIGKILL.
[12561.090558] systemd[1]: systemd-journald.service: Killing process 75430 (systemd-journal) with signal SIGKILL.
[12651.228990] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[12651.244675] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[12651.255652] systemd[1]: systemd-journald.service: Unit process 859 (systemd-journal) remains running after unit stopped.
[12651.270724] systemd[1]: systemd-journald.service: Unit process 75415 (systemd-journal) remains running after unit stopped.
[12651.285416] systemd[1]: systemd-journald.service: Unit process 75418 (systemd-journal) remains running after unit stopped.
[12651.299654] systemd[1]: systemd-journald.service: Unit process 75421 (systemd-journal) remains running after unit stopped.
[12651.313796] systemd[1]: systemd-journald.service: Unit process 75425 (systemd-journal) remains running after unit stopped.
[12651.327886] systemd[1]: systemd-journald.service: Unit process 75428 (systemd-journal) remains running after unit stopped.
[12651.341851] systemd[1]: systemd-journald.service: Unit process 75430 (systemd-journal) remains running after unit stopped.
[12651.355831] systemd[1]: systemd-journald.service: Unit process 75437 (systemd-journal) remains running after unit stopped.
[12651.370338] systemd[1]: Failed to start Journal Service.
[12651.378681] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 8.
[12651.390940] systemd[1]: Stopped Journal Service.
[12651.397801] systemd[1]: systemd-journald.service: Found left-over process 859 (systemd-journal) in control group while starting unit. Ignoring.
[12651.414744] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.429011] systemd[1]: systemd-journald.service: Found left-over process 75415 (systemd-journal) in control group while starting unit. Ignoring.
[12651.446003] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.460405] systemd[1]: systemd-journald.service: Found left-over process 75418 (systemd-journal) in control group while starting unit. Ignoring.
[12651.477450] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.492410] systemd[1]: systemd-journald.service: Found left-over process 75421 (systemd-journal) in control group while starting unit. Ignoring.
[12651.509964] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.524694] systemd[1]: systemd-journald.service: Found left-over process 75425 (systemd-journal) in control group while starting unit. Ignoring.
[12651.541827] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.556536] systemd[1]: systemd-journald.service: Found left-over process 75428 (systemd-journal) in control group while starting unit. Ignoring.
[12651.575905] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.592435] systemd[1]: systemd-journald.service: Found left-over process 75430 (systemd-journal) in control group while starting unit. Ignoring.
[12651.612706] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.629075] systemd[1]: systemd-journald.service: Found left-over process 75437 (systemd-journal) in control group while starting unit. Ignoring.
[12651.647205] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.893002] systemd[1]: Starting Journal Service...
Reinstalling the whole environment (xcp-ng + vms) will allow it to run for several days and then this issue will come back. We think it is a hardware issue but diagnostics show no such problem.