Backups (Config & VMs) Fail Following Updates

DustyArmstrong

I am having an issue whereby any time I update XO my backups will start to fail for one of a few reasons. They generally fall into either:

ENOENT: no such file or directory
Lock file is already being held
EEXIST: file already exists

The above, other than the lock file, all reference /run/xo-server/mounts followed by the relevant UUID paths. The lock file error returns the writer IncrementalRemoteWriter has failed the step writer.beforeBackup() with error Lock file is already being held. I am running XO in a Docker container via Docker Compose, but the run directory is not mounted as a volume. My backups are performed to an NFS remote.

Is there a procedure to rectify this i.e. is a path being cached (that needs cleared) and the backups are referencing it but it's no longer present or has been modified after an update? Is there a procedure I should be doing pre/post update to account for this? I appreciate I may be running XO in a non-standard way and this may simply be a quirk of that, but it generally runs fine until I update. If I re-run the failed backup that does tend to be successful but it will then fail again on the next scheduled run.

Also, to add, my backup retention is set to 3 but I am seeing 6 restore points. Seems each entry is duplicated (same size, date and time), both key and difference. I have verified on my remote that there are actually only 3.

magran17

@DustyArmstrong Hi Dusty,
Same error here, and I'm running XO in docker too.

I just updated the docker container image, and I found that I had two backups running to the same remote. I changed it to one script, and set the concurrency (under settings, advanced) to 2. I have not had that error since then.

Hope it helps.

Mark

DustyArmstrong

@magran17 Hey Mark, thanks for the info.

I've also made those changes with concurrency to 2. I've rebuilt my backup from scratch so hoping it goes OK from here! My issue with double restore points was my own fault, I had an SMB and NFS remote running together (I had updated to NFS as it's much quicker).

I'm not sure what will happen when I update the image again. Out of interest, which Docker image do you run? Ronivay or Ezka77?

DustyArmstrong

Update: this seems to happen every time I reboot the server or, in particular, update XO. I get the same 3 errors and have to rebuild my backup schedules from scratch each time. Once rebuilt they run perfectly until the next time I update. It may be because I run it in Docker, I'm not sure, but I'd love to understand what causes this and if there's any way to rectify without the rebuild. I don't really understand it and would appreciate any insight.

I get the following 3 problems every time.

EEXIST - this happens on my configuration backups.

Error: EEXIST: file already exists, open '/run/xo-server/mounts/f5bb7b65-ddea-496b-b193-878f19ba137c/xo-config-backups/d166d7fa-5101-4aff-9e9d-11fb58ec1694/20240819T140003Z/data.json'

ENOENT - this also happens on my configuration backups, on the same job.

Error: ENOENT: no such file or directory, rmdir '/run/xo-server/mounts/f5bb7b65-ddea-496b-b193-878f19ba137c/xo-pool-metadata-backups/d166d7fa-5101-4aff-9e9d-11fb58ec1694/ff3e6fa0-6552-e96a-989c-fc8db748d984/20240729T140002Z'

LOCKFILE HELD - This happens on my VM incremental backups. This log is from a prior run a while ago, but I expect my next run will do this as I rebooted.

>> the writer IncrementalRemoteWriter has failed the step writer.beforeBackup() with error Lock file is already being held. It won't be used anymore in this job execution.
Retry the VM backup due to an error
the writer IncrementalRemoteWriter has failed the step writer.beforeBackup() with error Lock file is already being held. It won't be used anymore in this job execution.

Start: 2024-06-29 01:01
End: 2024-06-29 01:41
Duration: 41 minutes
Error: Lock file is already being held

I only have one schedule for config and one schedule for VMs. The files for the config backup don't change, I don't reboot or anything mid-backup, but it seems to totally break the chain. For the VMs, I only have one backup schedule so there should never be another job running which has the lockfile held. Something about restarting the container causes an issue - it feels like something is being cached here but the cache isn't flushed on restart so it leaves some sort of zombified file(s) behind.

magran17

@DustyArmstrong
Hi Dusty,

I set mine to concurrency = 1, then set the config backup to be 1/2 hour before the VM backup, and had no further issues.

Just a guess, but space the execution times out so the jobs have no overlap in execution times, and set concurrency = 1.

I use the rovinray docker image, with stock defaults except for allowing NFS mounts. I use Portainer to manage docker, and can send my template file (same as docker compose).

All the best,
Mark

DustyArmstrong

@magran17 thanks Mark.

My config backup runs on a Tuesday and my VMs Friday night, so that happened last night. It did fail at first with the lockfile error as expected but then was successful on the retry. My concurrency is currently set to 2, I did have it on 1 originally but it doesn't seem to make a difference.

I use Ronivay's image too, it seems to work but yeah just these random 3 errors that I can only get rid of by blowing away all my backups/schedules and starting the chain(s) again.

I'm not really sure why it happens, I can only assume rebooting/updating causes some sort of cache to break in the way I have it set up. I am running it very unintended way (Raspi4 ARM64 using Binfmt emulation of x86) so I can't really expect perfection. It's slightly slow but it works super well other than this!

DustyArmstrong

An update, if anyone ever comes across this via search engine.

Turns out it was my container's timezone. The image was set to pure UTC, no timezone, by default, so I believe when it was writing files to my network storage it introduced a discrepancy. My network share was recording the file metadata accurately to real-time, so I assume when it came time to do another backup, the file time XO expected was different, making it think it was "stale" or still being "held".

Have now run both scheduled metadata and VM backups without any errors .

In summary: make sure your time, date and timezone are set correctly!