planedrop

planedrop

@jasonnix I've done extensive testing with this myself, first and foremost, Veeam is the one that would have to support it, not vice versa.

Second, it would be best to use XO for the backups, it's much more fluid and is fully integrated, I've been doing this for some time and it's been excellent in multiple production setups.

I also have tested using Veeam via agents within the VMs themselves (this was just for test purposes, I'd still not really recommend it) and it worked exactly as expected.

Using XO for this is still better though, it's generally faster, easier to setup, more reliable, and much faster and easier to recover from backups.

If you are considering this as a comparison to VMware, it's worth noting that it's not really a positive thing that VMware requires you buy a separate product entirely in order to handle backups.

planedrop

Wanted to post a quick update, it's been over a week now and the backups have been 100% successful.

Figured as such, but thought it was worth at least coming back here and confirming.

planedrop

I can confirm this is the case for me too, not a huge deal, but would be kinda nice if it could keep track of the name.

planedrop

@olivierlambert Yup, I've had exactly that a few times, usually on used boards.

@R2rho if possible, however annoying, I would also take the CPU out and check for pins on the motherboard being bent with a flashlight.

planedrop

IIRC the templates help define some of the UEFI specs and things like that, generally speaking though using something similar to what you're deploying, even if not the same version (i.e. Ubuntu 20.04 template for a Ubuntu 23.10) should be functional, at least in my experience this has never created an issue.

planedrop

@olivierlambert I will give this a shot and report back. It may be a day or so, one of the backups is still running (very large VM over S3 so takes a while) but once it's done I will go back and see if the failures go away.

planedrop

This is a new one, just updated XOA to 5.107.2 and now my backups are no longer working.

I have support and can put in a ticket, but figured it's better to try here first.

I am getting an error: Fail to connect to any Nbd client on the backups to Backblaze and on my SMB backups I just get a Footer1 !== footer2 error.

What's important here is that it's only about half my VMs, and this is a single host setup, so the NBD client issues don't really make sense to me, unless I'm misunderstanding something about NBD.

Anyone else seeing issues with backups after this update?

Also not seeing anything consistent, not like an issue with Windows VMs in specific, it seems random.

planedrop

@olivierlambert issue added, it's my first time doing it on GH so apologies if I missed anything or put it in the wrong category.

https://github.com/vatesfr/xen-orchestra/issues/7893

planedrop created this issue in vatesfr/xen-orchestra

closed Add Email Notification For License Issues To Prevent Backup Failure #7893

planedrop

This may be something for me to put a ticket in for, but I wanted to try and post here and do it publicly first since it could benefit others.

One of the environments I am managing has consistent backup failures and I haven't been able to get to the root cause of them, this post will probably be long with lots of details. The short of it is that I think it's only happening to large VMs, but I can't figure out why, the majority fail on "clean VM directory" and show missing VHDs or missing parent VHDs.

To start, this setup has 2 backups that run for all VMs on a nightly basis, one is uploaded to Backblaze and another is sent over SMB to a TrueNAS machine.

I have a similar setup in my lab at home, and it's not failed once, never ever. But all my VMs are under 100GB, this other environment has some that are more than 2TB, which is why I am starting to think that is the root cause.

XOA version is at 5.93.1, so not 100% up to date (will update shortly), but this has been an ongoing issue for months now so I don't think it's a version specific thing.

Backup Schedules

First wanted to explain my schedules in details, then will go into the errors we are seeing.

Both schedules backup the same number of VMs, 2 of which are slightly over 2TB in size (several VHDs).

Backblaze Backup

This one is setup to run every night
Concurrency of 2
Timeout of 72 hours (since they are large I set the timeout very big, but usually this finishes within a few hours, sometimes taking like 10)
Full Backup Interval is 15
NBD is enabled and set as 4 per disk
Speed is limited to 500MiB/s (this is never hit though)
Snapshot is normal
Schedule is set to run ever weekday at 5PM with a retention of 14 and force full backup disabled
Worth noting these B2 bucket settings are:
Lifecycle is set to keep only the last version of the file (plan is to adjust this more later)
Object lock is enabled but no default set, so nothing should be getting locked

SMB NAS Backup

Concurrency of 1
Full Backup Interval of 30
NBD is disabled, number of connections is 1
Snapshot mode is normal
Schedule is set to run every weekday at 8PM with a retention of 7
This NAS does do backups of this VM directory (an additional backup I run) but those start at 7PM and I have it set to snapshot the dataset before backing it up, so in theory anything XCP-ng is touching shouldn't be messed with
- I've been able to confirm TrueNAS's "snapshot first" feature (which runs before the backup starts) takes a snapshot, backs up the data of that snapshot, then deletes the snapshot, this whole thing is to prevent file locking on a directory that has other things accessing it

I know the backup retention periods etc.. are a bit odd here, if we think that could be causing an issue I'm happy to adjust them, was planning on reworking retention sometime soon anyway. But as far as I can tell it shouldn't cause a major problem.

The Errors

Backblaze

Several VMs, including smaller ones are seeing this issue, which maybe means my thoughts about this being a large VM specific issue are wrong?
It always happens during the clean VM directory process
Last log I have is 3 VMs with the below:
- UUID is Duplicated
- Orphan Merge State
- Parent VHD is missing (several times for each VM)
- Unexpected number of entries in backup cache
- Some VHDs linked to the backup are missing
On all of these the Backblaze "transfer" section of the logs is green and successful, but the clean VM directory is not, seems the merge is failing
Retrying VMs will sometimes work but other times will just fail again

SMB

Only seems to happen with big VMs, they will work fine for a while (several weeks) then start erroring out
The only fix I've found is to wipe the entire VMs directory on the NAS so the backup starts fresh
The error is always parent VHD is missing (with a path to a VHD that as far as I can tell exists)
Then followed by a "EBUSY: resourece busy or locked, unlink (vhd path)"
It's always a VHD that starts with a period, so ".2024**********.vhd"
Checking the NAS via shell and the file definitely exists and has the same permissions on it as everything else in the directory
Now another super interesting thing is, if I go to the VM Restore page, select the one that failed SMB, it will show no original key backup like so (top/most recent to bottom):
- Incremental
- Incremental
- Incremental
- Incremental
- Key
- Incremental
- Incremental

So as you can see, no original Key for the last 2 incrementals

Any ideas as to what could be causing this? I'm thinking they might be 2 entirely separate issues, it's just odd that they're both happening.

I will do what I can to troubleshoot this directly as well and update this post with anything else I find.

planedrop

I can confirm NFS is great on XCP-ng, would definitely encourage you got that direction, TBH FC and iSCSI are a tad outdated. There are still good use cases for them but NFS is the thing I'd always aim for in this setup.

And like @Danp said, if it's thin provisioned, then no it won't be using double the space.

planedrop

Wanted to post a quick update, it's been over a week now and the backups have been 100% successful.

Figured as such, but thought it was worth at least coming back here and confirming.

planedrop

@ravenet Yeah another night of successful backups so I think going back to Stable did fix the issue. 2 for 2 on that now.

planedrop

@ravenet All of my errors seemed related to NBD access, so if the concurrency setting was being ignored, that might be the source of the issue I was seeing.

I'll watch my lab as well and see if the concurrency is being respected or not on the latest from the sources build.

Glad to see you were on 8.3, so not related to me being on 8.2.

planedrop

@olivierlambert Gotcha. I'll see if I can get this issue to replicate in my lab at all but so far my backups have been smooth over there.

I'll try to re-create more similar backup jobs in the lab as well, maybe it's a specific setting or something on my jobs.

planedrop

@olivierlambert Happy to help in any way that I can as well!

Notably, I am not seeing any issues doing backups to SMB or S3 with my lab at home which is on the latest. My lab is XCP-ng 8.3 though, rather than 8.2 like this production setup (which will be getting upgraded to 8.3 now that it's LTS), so maybe something specific with the new backup code and 8.2?

planedrop

@ravenet @olivierlambert yeah going back to 5.106 seems to have resolved the issue. I want to give it one more day before saying 100% that it did, but all VMs in both my backup jobs last night finished properly.

planedrop

@olivierlambert I will give this a shot and report back. It may be a day or so, one of the backups is still running (very large VM over S3 so takes a while) but once it's done I will go back and see if the failures go away.

planedrop

@olivierlambert Good question, I am on Latest by mistake in this environment actually.

Is it safe to roll back to Stable channel even though I am already on latest?

planedrop

Still seeing this issue, trying to pinpoint it but haven't had any luck. It seems like each VM is about a 50/50 chance if it fails or succeeds, but the logs don't really lead me to anything and there's no consistent reason why it would be happening that I can find.

This is only since going to 5.107.2 as well, wasn't happening on the previous version (which I unfortunately don't recall the version number of).

planedrop

@olivierlambert This is great, thanks for letting us know! I'll give this a shot in my lab as soon as I can.