dredknight

dredknight

Hey all,

I am building a home lab and will be glad to test the new XCP with Cloudstack on top. Followed the repo!

dredknight

@olivierlambert said in Weird kern.log errors:

That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

Reply

All VMs spin VMs from the same image (Using CloudStack on top) and all workload pushed on them is the same.
If one VM was the issue it would make sense the VM is at fault, but in the current case both VMs on that server get broken at the same time.

All servers have 2 SSDs in RAID 1. In the last iteration we did not use RAID and placed 1 VM on different SSD disk just in case the issue can come from there. The problem still appeared in the same way as before.

We are thinking of other ways to hold tests at the moment. Will keep you updated :).

dredknight

@Linus-S here is something you can do:

run a virtual environment on your PC (Virtualbox?) that can export the disk in VHD or QCOW2 format.
install debian 12
build the XOA from sources
once installed shut the vm down and export the disk
if the disk is in QCOW2 format, covert it to VHD using qemu

qemu-img convert -O vpc  xoa-disk.qcow2 xoa-disk.vhd

import the disk in the restricted hypervisor environment
create a VM from it
(bonus) Make a backup of the VM so you have it in handy for other less fortunate days

dredknight

@Ascar B) is your way to go. VMs with PCI passthrough will fail migration (A) as the passthrough device itself cannot be migrated.

dredknight

@olivierlambert I was just trying to describe it better. If that is the case then all is good.

dredknight

@olivierlambert the additional IP is used when one have a dedicated network for ISCSI network traffic. Here are some examples;

Case 1
imagine you have host 3 interfaces:

eth0 is xcp management - 10.10.10.10
eth1 is trunk for vlans - <trunk>
eth2 is free and you want to use its bandwidth to attach your new ISCSI Storage. This way you can get the maximum bandwidth possible as eth2 is not overlapping with eth0. ISCSI configuration requires assigned IP on the interface but does not require it to be of type management (as eth0).

Case 2
Same as the previous one except the user decides to setup the ISCSI network ip on management interface.

dredknight

@creoleMalady one more suggestion for the sake of an alternative solution. If you insist on managing multiple servers not in host (even though running them in a pool would equalize the CPU features to the oldest CPU available) you can use Cloudstack as a management platform to do the templates and networking for you. It dynamically transfers templates and assigns networks on demand on the hosts as you provision vms on them.

The disadvantage is that you need to learn Cloudstack which can be much of an overhead by its own if you dont want to go too deep.

dredknight

FYI - it was a CPU issue.
We changed the CPUs between servers and it moved with them.

Thanks for the tips everyone!

dredknight

@olivierlambert said in Weird kern.log errors:

That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

Reply

All VMs spin VMs from the same image (Using CloudStack on top) and all workload pushed on them is the same.
If one VM was the issue it would make sense the VM is at fault, but in the current case both VMs on that server get broken at the same time.

All servers have 2 SSDs in RAID 1. In the last iteration we did not use RAID and placed 1 VM on different SSD disk just in case the issue can come from there. The problem still appeared in the same way as before.

We are thinking of other ways to hold tests at the moment. Will keep you updated :).

dredknight

@olivierlambert host = physical server. I fixed it in the last post.

dredknight

Yes only on this one it happens in the span of 2-5 days after complete reset.
From user perspective we know the following.

Server has 2 vms running on 80-100% CPU utilization (each server has 64 cores assigned, the server has 2 EPYC CPUs for a total of 128 cores).
After the issue occurs one of the VM becomes unresponsive and the other is kind of fine - you can login but no commands can be executed on it. For example you type "top" or "df -h" press enter and it stays like that indefinitely with no output.

One tip I got, though a bit far-fetched, is it can be a "cosmic ray" behaviour. I dont know about that but so far nothing else can be tracked.

dredknight

@Danp yes all servers came in one batch and had the same hardware and software installed at the time.
By all, I mean 36 in total same hardware and same workload. So it is just 1/36 error rate which pretty much means hardware
issue for me, but yet there is no diagnostic report of it which is quite peculiar.

dredknight

@dredknight

Best posts made by dredknight

Latest posts made by dredknight