@chrispage1 said in Auto coalescing of disks:
Apr 7 16:07:09 node-01 SMGC: [20480] text = session.xenapi.host.call_plugin(slave, "on-slave", "multi", args)
Seems the issue is here. Are all your slaves in pool healthy?
@chrispage1 said in Auto coalescing of disks:
Apr 7 16:07:09 node-01 SMGC: [20480] text = session.xenapi.host.call_plugin(slave, "on-slave", "multi", args)
Seems the issue is here. Are all your slaves in pool healthy?
@chrispage1 said in Auto coalescing of disks:
Question 2 - can I force coalesce by rescanning the disk? If I rescan the disk, is there any chance of this causing an outage?
This automatically kicks GC apart from regular schedule. Rescanning won't cause outage. Do check /var/log/SMlog
for any obvious error causing GC to not work.
@borzel Thanks. That rules out the my suspicion on those patches. We are still working on reproducing this issue without success. We would really appreciate if you or someone from community can arrange a test hosts which shows this problem. Reason for test host is because we will have to replace the kernel multiple times to observe change.
Another test users can do is to remove iscsi from equation. Run some workloads on local disks (with backups) and verify control domain memory usage.
@borzel Also, I somehow need to be able to reproduce the issue at lab. If you can give more details about how do you do backup, may be I can simulate something.
@borzel I wish comparing kernel-alt and base kernel was easy to catch this... I'm sure that the tapdisk IO code is same in kernel and kernel-alt.
Also the 2 patches mentioned earlier are also present in base kernel of xcp-ng 8.2 as well as kernel-alt 4.19.142. They are also present for xcp-ng 8.1 base kernel, however they are not present in xcp-ng 8.1 kernel-alt.
Can you confirm your kernel-alt version?
@borzel Between 4.19.19-6.0.9 to 4.19.19-6.0.10, following two patches were added.
0001-block-cleanup-__blkdev_issue_discard.patch
0001-block-fix-32-bit-overflow-in-__blkdev_issue_discard.patch
Both are well vetted and seems stable without any further changes in them. Was there anything else updated along with kernel?
@borzel How frequently do you restart VMs? And what's the last dom-id? # xl list
@trent234 I think system should run headless and Xen should be able to hide that device from dom0, can you reassure that BIOS is set to not use any display device?
@appollonius You can set nomodeset
option at the end of line which starts with module2 /boot/vmlinuz-4.19-xen
.
With other user it did no make any difference.
@delaf said in Alert: Control Domain Memory Usage:
one server (268) with 4.19.19-6.0.10.1.xcpng8.1: no more problem!
Yeah, we need to be sure that this is a stable kernel and somewhere after this, the memory leak seems to have introduced.
@appollonius I don't think there is any harm. You will be either be able to boot or not. You can always switch back to old settings.
@appollonius I think it will depend on hardware settings. See if you have option to change the mode in motherboard settings?
I don't have physical hardware access to validate this as I mostly use nested virtualization but XCP-ng /boot
has required files support both for UEFI and BIOS mode.
Maybe @stormi can confirm.
Yes, all drivers are stock kernel modules for kernel-alt
. It would be interesting to see the behavior by disabling updates
and override
. I think we can try both. 1st check if the downgraded kernel shows same symptoms and then disabling update drivers.
@olivierlambert @delaf what we know from kmemleak
so far is that it will only scan and report unreferenced objects. If any kernel module / kernel itself is still holding(referencing) the memory then it may not show up. We are evaluating other options to find this.
kernel-alt
is more related to upstream, so either this issue is known and fixed in upstream or it might have been introduced from kernel updates.
The oldest kernel available is 4.19.19-6.0.10.1.xcpng8.1
, is it possible to install it and see if the issue repeats?
@delaf and others, you can download and install and update from link which should work fine.
After the system is running for some time, user can # echo scan > /sys/kernel/debug/kmemleak
and then # cat /sys/kernel/debug/kmemleak
to see if there are any unreferenced objects floating in memory.
@jmccoy555 said in XCP-ng 8.2.0 RC now available!:
CephFS is working nicely, but the update deleted my previous secret in /etc and I had to reinstall the extra packages and recreate the SR and then obviously move the virtual disks back across and refresh
Were you not able to attach the pre-existing SR on CephFS? Accordingly, I'll take a look in the documentation or the driver.
@vegarnilsen Thanks, you got the correct one.
Can you share # modinfo bnx2x
?
We have /lib/modules/4.19.0+1/kernel/drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko
with 1.712.30-0
/lib/modules/4.19.0+1/updates/bnx2x.ko
with 1.714.24
/lib/modules/4.19.0+1/override/bnx2x.ko
with 1.715.0
One of them will be loaded in above order depending on its presence.
@vegarnilsen Ok, that was helpful.
Can you try installing broadcom-bnxt-en-alt.x86_64
and report the observations? You would need a reboot.
@appollonius I don't think it is driver issue (POST boot).. We may know more only from serial output.
Are you able to boot as-is in BIOS mode without reinstall?