CentOS 8 VM reboots under IO load

VoipDude

Hello everyone,

I've got a strange issue on my XCP-ng installation at home.

Short story:
In short, a CentOS 8 VM that I have keeps crashing whenever I try to perform updates or give it some IO load (more on this later). I seem to have the same issue with a Windows Server 2019 VM which also crashes sometimes while performing updates (but am unable to reproduce this consistently). A brand new CO8 VM does not have this issue neither other older CO VMs (5 and 6) on the same host.

Symptoms:
When I say crashing, from the XO console for the CO8 VM it simply looks like a reboot. I lose SSH and I see the VM rebooting, no error messages nor anything. There is nothing in the VM's logs either following the crash. I tried finding logs in XCP-ng and either they're empty or I'm not looking in the right place.

The VM is not running any services/workload either. The crash is reproducible whenever I run a yum update or an IO benchmark (such as fio)

Troubleshooting that I did:
At first, I suspected RAM issues. I ran memtest86 overnight with no errors reported. I even changed out the RAM from another working computer, but the same issues happened. So I do not suspect this is bad RAM.

The XCP-ng install is running in Software RAID1 (configured by the XCP-ng installer) over a pair of RAID edition SATA disks. I do not suspect the disks are bad, as other VMs have no issues neither are there any errors reported in logs.

I then thought perhaps this was a bug with XCP-ng 8.1, so I upgraded to 8.2 recently. Same issue persists with no difference.

I also doubt that it's a hardware problem in general as my other CentOS 5 and CentOS 6 VMs run rock solid, no matter how hard I hit the IO. A brand new CO8 VM was also able to complete IO benchmarks without crashing.

Nothing crazy was done on the crashing CO8 VM either, it's running the stock CO kernel. I even uninstalled the xen guest tools in case it was an issue with them.

My question is as follows:

What could I do to to troubleshoot this further? Does XCP-ng have a log that could give me clues?
I'm happy that this is happening in my home testing environment and not production, but I'd like to resolve this issue and gain insight into how I could troubleshoot this if I'm ever faced with such a thing in prod.

Thanks!

olivierlambert

Hi!

Have you checked https://xcp-ng.org/docs/troubleshooting.html ?

fred974

@voipdude Hi, I have a very similar issue with Centos 8 rebooting when copying data.

Did you managed to fix it?

Thank you

pescobar

I am experiencing the same problem with a Centos8 stream VM. Did you find a solution?

olivierlambert

Anything in the logs? Eg xl dmesg or dmesg when the VM crashes?

pescobar

@olivierlambert I was experiencing the same problem on a centos8 host. I could always reproduce the crash by triggering an rsync of a 10GB folder. I was also getting these lines in /var/log/xensource.log

Nov 16 17:46:30 bm-ve-srv02 xenopsd-xc: [debug|bm-ve-srv02|39 |Async.VM.clean_shutdown R:2f7f9c937513|xenops] Device.Generic.hard_shutdown_request frontend (domid=37 | kind=vif | devid=1); backend (domid=0 | kind=vif | devid=1)

I could workaround it by setting a fixed value for memory as suggested in post https://xcp-ng.org/forum/topic/4176/vm-keep-rebooting

In the "Advanced" tab for the VM I had "Memory Limits >> Dynamic 2GB/16GB"

I have changed it to "Memory Limits >> Dynamic 16GB/16GB" and the machine doesn't crash anymore when I trigger the rsync.

olivierlambert

Probably a problem with dynamic memory allocation and free memory space available/used by something else. Did you have anything in xl dmesg?

pescobar

@olivierlambert indeed, I had useful information in xl dmesg

(XEN) [4940621.304592] p2m_pod_demand_populate: Dom34 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4940621.304599] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4940846.207706] p2m_pod_demand_populate: Dom35 out of PoD memory! (tot=2097182 ents=2097120 dom35)
(XEN) [4940846.207716] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4940846.207718] Domain 35 (vcpu#3) crashed on cpu#12:
(XEN) [4940846.207721] ----[ Xen-4.7.6-6.9.xcpng  x86_64  debug=n   Not tainted ]----
(XEN) [4940846.207723] CPU:    12
(XEN) [4940846.207725] RIP:    0010:[<ffffffff91f6a639>]
(XEN) [4940846.207726] RFLAGS: 0000000000010206   CONTEXT: hvm guest (d35v3)
(XEN) [4940846.207729] rax: 0000000000000400   rbx: 0000000001933000   rcx: 0000000000000c00
(XEN) [4940846.207731] rdx: 0000000000000c00   rsi: 0000000000000000   rdi: ffff950686600400
(XEN) [4940846.207733] rbp: 0000000000000400   rsp: ffffabd642883b98   r8:  0000000000001000
(XEN) [4940846.207734] r9:  ffff950686600400   r10: 0000000000000000   r11: 0000000000001000
(XEN) [4940846.207736] r12: ffffabd642883cd0   r13: ffff950630d5d1f0   r14: 0000000000000000
(XEN) [4940846.207737] r15: ffffeb95cc198000   cr0: 0000000080050033   cr4: 00000000007706e0
(XEN) [4940846.207739] cr3: 000000028fa52002   cr2: 0000560144fb4fa0
(XEN) [4940846.207740] fsb: 00007f58e207fb80   gsb: ffff95077d6c0000   gss: 0000000000000000
(XEN) [4940846.207742] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
(XEN) [4940846.208059] p2m_pod_demand_populate: Dom35 out of PoD memory! (tot=2097182 ents=2097120 dom35)
(XEN) [4940846.208067] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4940846.208380] p2m_pod_demand_populate: Dom35 out of PoD memory! (tot=2097182 ents=2097120 dom35)
(XEN) [4940846.208383] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4940846.208689] p2m_pod_demand_populate: Dom35 out of PoD memory! (tot=2097182 ents=2097120 dom35)
(XEN) [4940846.208691] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941014.518002] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097182 ents=2097120 dom36)
(XEN) [4941014.518009] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941014.518011] Domain 36 (vcpu#1) crashed on cpu#8:
(XEN) [4941014.518014] ----[ Xen-4.7.6-6.9.xcpng  x86_64  debug=n   Not tainted ]----
(XEN) [4941014.518016] CPU:    8
(XEN) [4941014.518018] RIP:    0010:[<ffffffffae56a639>]
(XEN) [4941014.518019] RFLAGS: 0000000000010206   CONTEXT: hvm guest (d36v1)
(XEN) [4941014.518022] rax: 0000000000000400   rbx: 0000000000833000   rcx: 0000000000000c00
(XEN) [4941014.518024] rdx: 0000000000000c00   rsi: 0000000000000000   rdi: ffff9309f3efe400
(XEN) [4941014.518026] rbp: 0000000000000400   rsp: ffffa9cc42ec7b98   r8:  0000000000001000
(XEN) [4941014.518027] r9:  ffff9309f3efe400   r10: 0000000000000000   r11: 0000000000001000
(XEN) [4941014.518029] r12: ffffa9cc42ec7cd0   r13: ffff9309c48073f0   r14: 0000000000000000
(XEN) [4941014.518031] r15: ffffe851cccfbf80   cr0: 0000000080050033   cr4: 00000000007706e0
(XEN) [4941014.518032] cr3: 000000010a57e001   cr2: 0000558ea0189328
(XEN) [4941014.518034] fsb: 00007fd2b9c1cb80   gsb: ffff930abd640000   gss: 0000000000000000
(XEN) [4941014.518035] ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0018   cs: 0010
(XEN) [4941014.518373] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097182 ents=2097120 dom36)
(XEN) [4941014.518376] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941014.518703] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097182 ents=2097120 dom36)
(XEN) [4941014.518705] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941015.091088] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4941015.091098] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941015.091453] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4941015.091456] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941019.236252] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4941019.236262] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941019.237006] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4941019.237013] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02
(XEN) [4941019.237301] p2m_pod_demand_populate: Dom36 out of PoD memory! (tot=2097181 ents=2097120 dom0)
(XEN) [4941019.237303] domain_crash called from p2m_pod_demand_populate+0x76a/0xb02

thanks for you help.

olivierlambert

That's pretty clear. Your host didn't have enough "populate on demand" memory, used for dynamic memory usage. So the domain crashed when trying to get more memory in live.

pescobar

@olivierlambert that's weird because if I go to XOA >> hosts to check the information for this hypervisor it says:

RAM: 178 GiB used on 256 GiB (78 GiB free)

I am running

XCP-ng 7.6.0 (GPLv2)

stormi

There might be an issue somewhere in the way dynamic memory is handled, but I'm afraid it would be a lot of work to debug and we're not likely to do it for XCP-ng 7.6.

Alternatively, maybe at some point the host used all the available RAM and released it since?

pescobar

@stormi indeed I don't think it's worth the time debugging the issue in such an old version of xcp-ng, specially when there is a workaround by setting a fixed amount of ram.

We should upgrade this host anyway and I will report back in case we still experience similar issue with the latest stable version.

VoipDude

Hello guys,

Glad to hear that my thread had traction and others helped with troubleshooting

My issue still keeps on happening and I now just left that Win Server 2019 VM that keeps crashing nightly when it tries to auto-apply Windows updates.
xl dmesg shows that it's out of memory:

[14:59 xenhome ~]# xl dmesg
m_pod_demand_populate: Dom18 out of PoD memory! (tot=2097695 ents=524256 dom0)
(XEN) [4145112.313876] domain_crash called from p2m_pod_demand_populate+0x751/0xa40
(XEN) [4145112.317876] p2m_pod_demand_populate: Dom18 out of PoD memory! (tot=2097695 ents=524256 dom0)
(XEN) [4145112.317879] domain_crash called from p2m_pod_demand_populate+0x751/0xa40
(XEN) [4145112.320228] p2m_pod_demand_populate: Dom18 out of PoD memory! (tot=2097695 ents=524256 dom0)

However, this host should have more than enough RAM. Here is a screenshot of the RAM graph from XO for the last week:
Screen Shot 2021-11-30 at 3.04.59 PM.png

The windows VM in question has a 2GB/8GB dynamic allocation, but the graph shows the 8GB always in use:
Screen Shot 2021-11-30 at 3.06.53 PM.png

And unlike @pescobar, I am running the latest version of XCP-NG here:

[15:03 xenhome ~]# cat /etc/redhat-release 
XCP-ng release 8.2.0 (xenenterprise)

I'm glad to hear that not doing dynamic solved the issue for pescobar, but now I want to get to the bottom of this because maybe this bug might impact someone in prod.

Let me know what other info I could provide so that we can troubleshoot this further.

Thanks!

olivierlambert

There's not enough memory for the ballooning driver to grow, and this cause a domain crash.

Going to the bottom of this is not simple I'm afraid.