Apr 7 16:07:09 node-01 SMGC:  text = session.xenapi.host.call_plugin(slave, "on-slave", "multi", args)
Seems the issue is here. Are all your slaves in pool healthy?
Question 2 - can I force coalesce by rescanning the disk? If I rescan the disk, is there any chance of this causing an outage?
This automatically kicks GC apart from regular schedule. Rescanning won't cause outage. Do check
/var/log/SMlog for any obvious error causing GC to not work.
@borzel Thanks. That rules out the my suspicion on those patches. We are still working on reproducing this issue without success. We would really appreciate if you or someone from community can arrange a test hosts which shows this problem. Reason for test host is because we will have to replace the kernel multiple times to observe change.
Another test users can do is to remove iscsi from equation. Run some workloads on local disks (with backups) and verify control domain memory usage.
@borzel I wish comparing kernel-alt and base kernel was easy to catch this... I'm sure that the tapdisk IO code is same in kernel and kernel-alt.
Also the 2 patches mentioned earlier are also present in base kernel of xcp-ng 8.2 as well as kernel-alt 4.19.142. They are also present for xcp-ng 8.1 base kernel, however they are not present in xcp-ng 8.1 kernel-alt.
Can you confirm your kernel-alt version?
@borzel Between 4.19.19-6.0.9 to 4.19.19-6.0.10, following two patches were added.
Both are well vetted and seems stable without any further changes in them. Was there anything else updated along with kernel?
@appollonius I think it will depend on hardware settings. See if you have option to change the mode in motherboard settings?
I don't have physical hardware access to validate this as I mostly use nested virtualization but XCP-ng
/boot has required files support both for UEFI and BIOS mode.
Maybe @stormi can confirm.
Yes, all drivers are stock kernel modules for
kernel-alt. It would be interesting to see the behavior by disabling
override. I think we can try both. 1st check if the downgraded kernel shows same symptoms and then disabling update drivers.
@olivierlambert @delaf what we know from
kmemleak so far is that it will only scan and report unreferenced objects. If any kernel module / kernel itself is still holding(referencing) the memory then it may not show up. We are evaluating other options to find this.
kernel-alt is more related to upstream, so either this issue is known and fixed in upstream or it might have been introduced from kernel updates.
The oldest kernel available is
4.19.19-220.127.116.11.xcpng8.1, is it possible to install it and see if the issue repeats?
After the system is running for some time, user can
# echo scan > /sys/kernel/debug/kmemleak and then
# cat /sys/kernel/debug/kmemleak to see if there are any unreferenced objects floating in memory.
CephFS is working nicely, but the update deleted my previous secret in /etc and I had to reinstall the extra packages and recreate the SR and then obviously move the virtual disks back across and refresh
Were you not able to attach the pre-existing SR on CephFS? Accordingly, I'll take a look in the documentation or the driver.
@vegarnilsen Thanks, you got the correct one.
Can you share
# modinfo bnx2x?
One of them will be loaded in above order depending on its presence.
@vegarnilsen Ok, that was helpful.
Can you try installing
broadcom-bnxt-en-alt.x86_64 and report the observations? You would need a reboot.