Weird kern.log errors

dredknight

@Danp yes all servers came in one batch and had the same hardware and software installed at the time.
By all, I mean 36 in total same hardware and same workload. So it is just 1/36 error rate which pretty much means hardware
issue for me, but yet there is no diagnostic report of it which is quite peculiar.

olivierlambert

And you can reproduce the issue only on this one?

dredknight

Yes only on this one it happens in the span of 2-5 days after complete reset.
From user perspective we know the following.

Server has 2 vms running on 80-100% CPU utilization (each server has 64 cores assigned, the server has 2 EPYC CPUs for a total of 128 cores).
After the issue occurs one of the VM becomes unresponsive and the other is kind of fine - you can login but no commands can be executed on it. For example you type "top" or "df -h" press enter and it stays like that indefinitely with no output.

One tip I got, though a bit far-fetched, is it can be a "cosmic ray" behaviour. I dont know about that but so far nothing else can be tracked.

olivierlambert

When do you say "host", are you talking about the physical host or a VM?

dredknight

@olivierlambert host = physical server. I fixed it in the last post.

olivierlambert

That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

dredknight

@olivierlambert said in Weird kern.log errors:

That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

Reply

All VMs spin VMs from the same image (Using CloudStack on top) and all workload pushed on them is the same.
If one VM was the issue it would make sense the VM is at fault, but in the current case both VMs on that server get broken at the same time.

All servers have 2 SSDs in RAID 1. In the last iteration we did not use RAID and placed 1 VM on different SSD disk just in case the issue can come from there. The problem still appeared in the same way as before.

We are thinking of other ways to hold tests at the moment. Will keep you updated :).

olivierlambert

Please do, I'm eager to know the root cause

bleader

It doesn't ring a bell as it is for me.

What I see from the first log is the segfault on blktap and in xcp-rrdd-xenpm, likely that was while writing to a disk. In all cases, it is a xen_mc_flush() call.

Given it happens on a single machine, I would venture it could be related to the disk controller, or disk itself, you could try to have a look at a dmidecode to see if the controllers are the same as on othe machines (sometimes there are small discrepencies between supposedly identical hardware), and check the drives with smartctl for any health issues. But especially as you were on raid1 originally, I doubt an issue with the drives themselves would lead to such an issue...

andSmv

Yeah, The HW problem seems to be a good guess.

The track that we can follow here is xen_mc_flush kernel function which raises a warning when a multicall (hypercall wrapper) fails. The interesting thing here would be to take a look at XEN traces. You can type xl dmesg in dom0 to see if XEN tells something more (if it isn't happy on some reason)

dredknight

FYI - it was a CPU issue.
We changed the CPUs between servers and it moved with them.

Thanks for the tips everyone!

olivierlambert

Ahh great news! CPU issues are really tricky