domU hang on Xen v4.7
We are currently experiencing a problem with random freezing on a Paravirt Dom_U running on a Xen Hypervisor V4.7 at several of our customers (8 reported sites, with 15 instances). The freeze happens at random, and is not frequent – so far, our attempts to reproduce it in house/on demand have been unsuccessful.
The frozen Dom_U manifests no stress or errors when it freezes and appears to be unrelated to load. The Dom_0 (SuSE Leap 42.2) is only running two Dom_U, the problem PV Linux server (also Leap 42.2) and an HVM Windows Server 2012.
The Dom_0 shows saturated CPU usage for the frozen dom_u, but no console response or network response (or disk IO).
We suspect the issue is grant table related, but
xl debug-keys gpresents unreadable/unclear information into the
xl dmesgoutput - when compared to later platforms where we are running Xen 4.12.
The freezes are very sporadic/random - and we have increased default grant tables to 256 on one of the symptomatic sites. It has not frozen since, but it has only been a week - this is inconclusive as we are unable to inspect the grant table usage using
debug-keys gon 4.7, so have no concrete evidence that it was even the problem.
Is there any way on V4.7 to get accurate information about a Dom_U 's grant table usage - and what other factors may be causing this type of hang/freeze please?
Currently, due to issues with deployment options, we’re not able to upgrade all impacted sites to a newer version of Xen or migrate them to an alternative product such as XCP-ng, regardless that it might be the best choice. We’re looking to fix the sites we’ve got out there currently if possible, and any assistance the community can offer would be greatly appreciated.
Did you try to run those freezing Guest VMs in HVM?
Are you using CPU/Memory hot plug?
Can you try a screen session where you can keep
xl console DOM-IDon and when guest freezes, you would see logs there if any?
You can also run xen in debug mode to let it print more erroneous information in
There is nothing in any logs at all, on either the Dom_0 or Dom_U at the time of the crash, the journal on the Dom_U stops updating at the instant of the freeze and only resumes after it has been destroyed and re-started.
We are in the process of moving sites to HVM, that is an ongoing part of our plan, 1 down...
We've not tried running a screen session, but that is a very good idea, and we'll do that - thanks! I don't believe we've tried running in debug mode either, so will pursue that too.
Thanks for the advice, I appreciate it - I know it isn't anything even remotely related to your work, but as an XCP-ng user personally, I wasn't sure where else to try.