Parent VHD Missing Errors During SMB Backup
-
I've been investigating this one for some time but haven't been able to find a solution, hoping someone can point me in the right direction or see if they've gotten the same thing. I will also try to replicate the issue in my lab but so far that hasn't been doable.
In a production setup I have quite a few VMs that backup on a nightly basis to a very fast TrueNAS CORE server, the backups work well, but for some reason every once in a while I get the following errors on a VM backup and it reports as failed. It's almost always just 1 single VM, and after 3 or 4 additional backups the error will go away (despite retention being 7 and full backup interval being 30 days), also if I wipe the directory from the TrueNAS box for that VM, the next backup of it will succeed.
- VHD Check Error
- Parent VHD is Missing
- Under the remote logs: VHD Check Error
- EBUSY: resource busy or locked, unlink (VHD path)
It's worth noting that these errors always seem to come up when the TrueNAS machine is backing up it's directory to a cloud provider, which would make sense if TrueNAS was working with the VHD that XCP-ng was trying to access, however, TrueNAS is setup to snapshot first and my understanding of that is TrueNAS ONLY touches the snapshot for the backup process, so the file shouldn't be locked. I may be wrong, but long ago I did NOT have TrueNAS set to snapshot before cloud backups and I got this same EBUSY error ALL the time, then the issue went away (mostly) when enabling "snapshot first".
For reference, this reddit posts talks about this "snapshot first" feature: https://www.reddit.com/r/freenas/comments/gpz701/clarity_on_take_snapshot_for_cloud_sync_tasks/
In short, it appears TrueNAS should be snapshotting the directory, then backing up that snapshot, then removing it, so that "live" data isn't effected/being written to during the backup.
And my TrueNAS machine starts it's backups BEFORE XO does, so the snapshot shouldn't be happening at like the same time XO tries to access the directory. And the backup of this directory usually takes several hours, so the snapshot isn't being deleted while XO backs up either.
It's entirely possible this is more of a TrueNAS issue than an XCP-ng/XO thing, but wanted to post about it.
Anyone else seen this with large SMB VM backups?
I'll keep trying to replicate in my lab too and report back if I can duplicate the issue.
This isn't urgent (which is why I'm just posting and not filling out a ticket haha) since I have the same VMs backed up directly to a cloud provider, so isn't a data resilience issue.
-
@planedrop Another interesting note, it seems my backup lists for this VM don't show the key backups on TrueNAS anymore, but TrueNAS definitely has the key backups.
The VHD file that was locked or busy DOES exist on the TrueNAS directory though.
I have tried force restarting these backups but the same error usually happens even during the times TrueNAS isn't snapshotting/backing up.
-
@planedrop
I usually don't use SMB for remotes - prefer for stability reasons NFS.
But in the past on customer systems we had, from time to time, the problem that the SMB Client/Server (both had the problem, so it might be the case that they stalled each other) processes stalled for a moment, which might cause that file locks are not set or ended correctly.Maybe it's something like this in your case too.
Catching that is a bit of a pain in the *** , because you need to do process memory tracking on both sides to see if and when they stall for a short moment.Not sure if there is a way under TrueNAS to check if a file has a lock set or not. If possible it might be already enough to remove the lock of the file (if possible via cli) to make it visible to the XCP-NG host again.