VM's with around 24GB+ crashes on migration.

Kevin87

HI all,

When i do a VM migration from node to node with the use of a shared storage and this vm has more then 24GB of memory it always crashes.
Making a kdump to vmcore-dmesg

This are the lines:
vmcore-dmesg.txt

Anyone knows why this gets triggered? The VM's uses static memory with the same min/max limit. So its not trying to lower the memory before the migration.
Also the VM is not really busy, and is transferred over 10Gbit network in less then a minute.

Kind regards,

Kevin

Danp

Hi Kevin,

Are you running Xen or XCP-ng? Which OS is being used in the VM?

Dan

Kevin87

Hi Danp,

I am using XCP-NG8.2 latest updates.
The VM OS is in most cases Centos7.9 With guest tools installed.

olivierlambert

xenwatch: page allocation failure: order:5, mode:0xc0d0

Not a good sign

@andSmv can you take a look when you can?

andSmv

Hmmm, there's two poblems here (page alloc failure warning and NULL pointer BUG) in context of xenwatch kernel thread and basically both of them happenning when configuring XEN network frontend/backend communications.

Normally this isn't related to memory footprint of the VM, but rather to XEN frontend/backend xenbus communication framework. Does the bugs desappear when you reduce the memory size for the VM and when all others params/environnement are the same?

Kevin87

I have not tested what happens if i reduce the memory of that VM cause the VM need that amount of memory. I do know that we have around 190 vm's and it only happens with vm's with alot of memory.

andSmv

It's obviously is not exluded that the issue is related to the memory footprint. Moreover the first warning "complains" about failure on memory allocation. (I suppose that the "receiver" node has enough memory to host the VM).

Normally XEN hasn't limitations on Live Migration 24GB VM. So, it's difficult to say what's the issue here. But clearly there's a possibity that this is a bug in XEN/toolstack... Memory fragmentation on the receiver" node can be an issue too.

You can probably run some different configurations to try to pinpoint this issue.
May be for the start try to migrate a VM when no other VMs are running on the "receiver" node. Also try to migrate a VM with no network connections (as the issue seems to be related to network backend status changes)....

Kevin87

Dear,

There is indeed enough memory on the recieving node. We are having nodes with 1TB of memory, and currently they are loaded with around 500GB each. Ill try to reproduce it with a cloned production server. So i can reproduce it a few times with and without network and to a empty receiving node. Ill keep you updated.