We are currently experiencing a problem with random freezing on a Paravirt Dom_U running on a Xen Hypervisor V4.7 at several of our customers (8 reported sites, with 15 instances). The freeze happens at random, and is not frequent – so far, our attempts to reproduce it in house/on demand have been unsuccessful.
The frozen Dom_U manifests no stress or errors when it freezes and appears to be unrelated to load. The Dom_0 (SuSE Leap 42.2) is only running two Dom_U, the problem PV Linux server (also Leap 42.2) and an HVM Windows Server 2012.
The Dom_0 shows saturated CPU usage for the frozen dom_u, but no console response or network response (or disk IO).
We suspect the issue is grant table related, but xl debug-keys g
presents unreadable/unclear information into the xl dmesg
output - when compared to later platforms where we are running Xen 4.12.
The freezes are very sporadic/random - and we have increased default grant tables to 256 on one of the symptomatic sites. It has not frozen since, but it has only been a week - this is inconclusive as we are unable to inspect the grant table usage using debug-keys g
on 4.7, so have no concrete evidence that it was even the problem.
Is there any way on V4.7 to get accurate information about a Dom_U 's grant table usage - and what other factors may be causing this type of hang/freeze please?
Currently, due to issues with deployment options, we’re not able to upgrade all impacted sites to a newer version of Xen or migrate them to an alternative product such as XCP-ng, regardless that it might be the best choice. We’re looking to fix the sites we’ve got out there currently if possible, and any assistance the community can offer would be greatly appreciated.