Alert: Control Domain Memory Usage
Hi, i have some strange reccuring memory usage problems after upgrading to XCP-NG 8.0.
Sometimes, after about a month of uptime, the control domain memory usage on some hosts grows until its full. But there is no visible process eating up the RAM.
When i don`t reboot the Server, I can see those errors beeing reported for a week or so, before something crashes on the Host. Mostly its ovs-vswitchd, which breaks network connectivity, of course.
This happens to several servers in several pools which were running for months or even years with Xenserver versions from 7.0 upwards. It even happened on servers which were running XCP-NG 7.4 for hundreds of days without problems.
This server is in maintanance mode now. Here is the output of top, sorted by %MEM:
top - 17:42:23 up 49 days, 13:38, 1 user, load average: 1.07, 1.14, 1.33 Tasks: 345 total, 1 running, 198 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.7 us, 0.2 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.1 st KiB Mem : 7498176 total, 96436 free, 7149172 used, 252568 buff/cache KiB Swap: 1048572 total, 895996 free, 152576 used. 126440 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2010 root 10 -10 1302552 155772 9756 S 1.0 2.1 1043:23 ovs-vswitchd 2651 root 20 0 589132 28908 4540 S 11.3 0.4 227:39.22 xapi 1525 root 20 0 288660 25012 1564 S 0.0 0.3 655:22.71 xcp-rrdd 2663 root 20 0 214868 16268 6432 S 0.0 0.2 48:49.51 python 2650 root 20 0 496804 15624 5912 S 0.0 0.2 162:33.50 xenopsd-xc 1179 root 20 0 158980 15392 6876 S 0.0 0.2 21:29.70 message-switch 1601 root 20 0 62480 10408 3848 S 0.0 0.1 118:33.49 xcp-rrdd-xenpm 1597 root 20 0 119764 10292 3660 S 0.0 0.1 270:19.65 xcp-rrdd-iostat 1516 root 20 0 76588 9796 3248 S 0.0 0.1 159:55.67 oxenstored
See free -m:
[17:54 xs03 ~]# free -m total used free shared buff/cache available Mem: 7322 6992 83 0 247 114 Swap: 1023 148 875
[17:54 xs03 ~]# cat /proc/meminfo MemTotal: 7498176 kB MemFree: 85376 kB MemAvailable: 117788 kB Buffers: 12300 kB Cached: 97024 kB SwapCached: 23740 kB Active: 131240 kB Inactive: 118996 kB Active(anon): 79400 kB Inactive(anon): 75292 kB Active(file): 51840 kB Inactive(file): 43704 kB Unevictable: 168644 kB Mlocked: 168644 kB SwapTotal: 1048572 kB SwapFree: 897276 kB Dirty: 24 kB Writeback: 0 kB AnonPages: 305640 kB Mapped: 56028 kB Shmem: 376 kB Slab: 144436 kB SReclaimable: 38100 kB SUnreclaim: 106336 kB KernelStack: 11520 kB PageTables: 14236 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 4797660 kB Committed_AS: 3936576 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 12160 kB HardwareCorrupted: 0 kB CmaTotal: 0 kB CmaFree: 0 kB DirectMap4k: 7897600 kB DirectMap2M: 0 kB
How can i find out what is using the RAM?
htopwill be better to spot things You can alternatively use Netdata
thanks for your quick Response.
Yes, htop is much prettier, but it doen`t see more the top, at least in this case:
I think, Netdata will also just see those things?
- Can you restart the toolstack and see if it changes memory footprint?
- Anything suspicious in
dave last edited by dave
- toolstack restart didn`t change much
- good idea: Indeed there are some entries in dmesg:
[4286099.036105] Status code returned 0xc000006d STATUS_LOGON_FAILURE [4286099.036116] CIFS VFS: Send error in SessSetup = -13
With "mount" i can see two old cifs iso libaries, which have allready been detached from the pool.
I was able to unmount one of them.
The other one is busy:
umount: /run/sr-mount/43310179-cf17-1942-7845-59bf55cb3550: target is busy.
I will have a look into it a little later and report.
I was not able to find what was keeping the target busy.
I did a lazy unmount "umount -l", since then i have no new dmesg lines, but the memory isn`t freed.
So i`m not sure, if this is/was the reason for the Problem. Other Servers in the pool have this same old mount, but do not use this much of RAM. But i will check cifs mounts, the next time i see this Problem.
I`m still looking for a way how to find out what is using this memory, when there is no process visible which uses it. Even though this may be a common Linux question
I had a look at another server in another pool this morning, which also had memory alerts.
Gues what: There was an unresponsive CIFS ISO share too.
The share looked like it was online form XCP-NG Center, but the containing isos were not shown.
I could not detach it from XCP-NG Center, but after a few tries i could do a pbd-unplug.
This share was on a smal NAS, which may have been restarted during the pool was running.
So maybe something happens when you have a unresponsive CIFS ISO share, that eats up your RAM.
If this is realy the reason, i think this behaviour changed in never versions om XCP-NG/Xenserver. I had dozens of servers, where the ISO store credentials changed or/and the CIFS servers has been restarted, without this Problem.
We're seeing the same kind of development on two servers. Nothing interesting in dmesg or so.
Will have to resize for now I guess.