log_fs_usage / /var/log directory on pool master filling up constantly

MajorP93

I applied

xe host-param-set other-config:auto-scan-interval=120 uuid=<Host UUID>

on my pool master as suggested by @flakpyro and it had a direct impact on the frequency of SR.scan tasks popping up and the amount of log output!

I implemented graylog and remote syslog on my XCP-ng pool after posting the first message of this thread and in the image pasted below you can clearly see the effect of "auto-scan-interval" on the logging output.

I will keep monitoring this but it seems to improve things quite substantially!

Since it appears that multiple users are affected by this it may be a good idea to change the default value within XCP-ng and/or add this to official documentation.

Pilow

@MajorP93 said in log_fs_usage / /var/log directory on pool master filling up constantly:

will keep monitoring this but it seems to improve things quite substantially!

Since it appears that multiple users are affected by this it may be a good idea to change the default value within XCP-ng and/or add this to official documentation.

Reply

nice, but these SR scans have a purpose (when you create/extend an SR, to discover VDIs and ISOs, ...)
on the legitimacy of reducing the period, and the impact on logs, it should be better documented yeah

xe host-param-set other-config:auto-scan-interval=120 uuid=<Host UUID>

never saw this command line in the documentation, perhaps it should be there with full warnings ?

MajorP93

@Pilow correct me if I'm wrong but I think day-to-day operations like VM start/stop, SR attach, VDI create, etc. perform explicit storage calls anyway so they should not depend strongly on this periodic SR.scan which is why I considered applying this safe

Pilow

@MajorP93 I guess so, if someone from Vates team get us the answer as why so frequently perhaps it will enlighten us

bvitnik

@Pilow agreed. This shouldn't be the norm. auto-scan-interval=120 is not going to be good for everyone. The majority of people probably don't have any problem with the default value, even in larger deployments.

On the other hand, the real cause of the issue is still elusive.

denis.grilli

@bvitnik Not really,
what is elusive here is if we can reduce the auto scan frequency and why is set by default to frequent but that to cause the increase of logs is the auto scan is quite clear from MajorP93 test..
The auto scan log shows a lot of lines for each disks and when you have like 400 - 500 disks and you scan them every 30 seconds you definitely have a lot of logs.

I think the log partition is quite small to be honest but the logs is also very chatty.

bvitnik

@denis.grilli I understand... but my experience is that even with the default scanning interval the logs become the problem when you get in the range of tens of SRs, thousands of disks. MajorP93's infra is quite small so I believe there is something additional that is spamming the logs... or there is some additional trigger for SR scan.

Update: maybe the default value changed in recent versions?

MajorP93

Well I am not entirely sure but in case the effect of SR.scan on logging gets amplified by the size of virtual disks aswell (in the addition to the number of virtual disks) it might be caused by that. I have a few virtual machines that have a) many disks (up to 9) and b) large disks.
I know it is rather bad design to run VMs this way (in my case these are file servers), I understand that using a NAS and mounting a share is better in this case but I had to migrate these VMs from the old environment and keep them running the way they are.
That is the only thing I could think of that could result in SR.scan having this big of an impact in my pool.

Pilow

@MajorP93 throw in multiple garbage collections during snap/desnap of backups on a XOSTOR SR, and these SR scans really get in the way

MajorP93

Another thing that I noticed: despite enabling remote syslog (to graylog) for all XCP-ng hosts in the pool /var/log gets filled up to 100%.
Adding remote syslog seem to not change usage of /var/log at all.

Official XCP-ng documentation states otherwise here: https://docs.xcp-ng.org/installation/install-xcp-ng/#installation-on-usb-drives

The linked part of the documentation indicates that configuring remote syslog can be a possible solution for /var/log space constraints which seems to be not the case.

I feel like logging could use some investigation by Vates in general.

gumbo2k

Q1: Can I disable the auto-scan completely for a single SR?
That particular SR only contains block devices (disks) that I pass through directly to VMs. Those disks are > 2 TB and I'd love to put in standby when they are not in use. One of them is only used as a backup target within that VM and I've created a service that runs hdparm -S 150 /dev/sdc in Dom0. Putting the disk in standby works, but the rescan wakes up that disk.

Q2: If I can't disable rescans ofr individual SRs. Can I set the rescan interval to 86400 (once per day) on the whole host. Is there any negative effect in a homelab setting?

Q3: Does a manual "Rescan" on the SR trigger the same cleanup jobs (e.g. coalesce of snapshots), that the periodic scan does?

denis.grilli

@gumbo2k I am not expert so I cannot give you an answer for sure but I would not disable the sr.scan completely, I feel like that could cause problems at some point.

Anyway, I don't think touching the scan is necessary anymore.

Vates has release many new updates for both XOA and xcp-ng and now the scan is very fast.

If you haven't done already I would suggest to upgrade your xcp-ng and re-test because now everything run smoothly and fast.

gumbo2k

@denis.grilli The problem is not the performance of the scan ... the problem is, that this storage device only consists of block devices (disks) that should go into standby mode when not used ... but I think I've found a code line that checks if other-config for an SR contains auto-scan: false... I think ...

poddingue

The sr.scan-driven SMlog growth angle that gumbo2k surfaced is a real lead; there's some context in the storage-related log files reference, but the docs don't go as far as "here's how to throttle it safely on a pool where the underlying disks should spin down."

Soft ping to @Team-Storage and @Team-Hypervisor-Kernel: could one of you weigh in on whether other-config:auto-scan=false on the SR is the supported way to reduce scan pressure, or if there's a better lever? I don't want to send anyone down a path that breaks an SR. Apologies if this has already been answered somewhere I haven't seen.