Backup automatic retry

Andrew

@olivierlambert @julien-f
Will you please add an option to automatically retry backups for individual VMs when they fail. It should be at least a check box on the whole job or a number of retries (0 means don't retry).

This is important for continuous replication and off-site (ie. S3) delta backups that may have intermittent problems due to WAN or service provider issues. Or the VM is just busy at the time (migrating or something).

I don't want the whole job repeated, just retry the individual VMs that failed to complete correctly. They should just be requeued at the end as part of the same job. You don't want to just repeat it as the order of the VMs may be causing issues and the job should just continue with the next VM before trying failed ones again at the end.

Also, since backup jobs can take a while to run it's nice that XO does not start the same job again while it's already running. But I don't see this as a failure but as a skipped job. It would be good if the skipped job was not marked as failed (because it's not) but just skipped.

Thanks.

olivierlambert

This request is familiar to me Maybe we "just" allowed to resume failed VM with a button but not the option to retry Do you remember @julien-f ?

gsrfan01

There's an old issue on GitHub referencing this: https://github.com/vatesfr/xen-orchestra/issues/2139

I have some intermittent issues where the backups will fail for a few VMs in a job but work fine on the retry. It would be nice to be able to set a retry count for the backup job. This way if a couple VMs run into an issue they get auto-retried for backup consistency. Then emails could get kicked off in the event that it still fails after the retry number.

julien-f created this issue in vatesfr/xen-orchestra

closed [Backup | Job] Configurable number of retries #2139

olivierlambert

Adding @florent in the loop

florent

@olivierlambert I am not sure it would be trivial to restart a delta vm backup : since the job did not really terminate, we did not go through the clean vm phase at the end

restarting the job may lead to broken vhd used as if they're fine.
a dephasing between snapshot and backup will lead to more full backups than needed and more pressure on the backup
restarting the full vm backup when only a vdi failed ( or part of a vdi) seems suboptimal

I would better harden each backup phase and retrying them individually if needed than trying to restart the whole VM.

@gsrfan01 can you give us more detail of the phase that fails ? clean / snapshot/transfer/merge ?

gsrfan01

@florent Looks like they're all during the transfer phase, usually it's this error:

EINVAL: invalid argument, open '/run/xo-proxy/mounts/fbd663b9-57f1-4610-ae32-3f1d69fea68d/xo-vm-backups/f92970cd-523f-3915-7254-5f0b4a37d713/vdis/1d5ddf64-2165-4d26-b05c-330546b48ee0/69786d14-81ea-446c-8870-e43dfedae766/20220622T050722Z.vhd'

I did have an HTTP time out last night for 8 / 12 VMs, but 4 backed up just fine which seems odd, the VM was up the whole time.

This is backing up over a proxy to a TrueNAS system local to the proxy, but my home backups also running over a proxy to an unRAID system are running fine.

I could be off the mark here, loving XCP-NG and XO so far though!

Andrew

@florent @olivierlambert Looking at failed S3 backup problems the retry should be easy. Currently on the job report there is a button to restart all failed VM backups in that job. Can you just add an option for automatic restart failed VM backups after job completion. It can even be a separate job (just as clicking the button is). It would just "click" the restart failed VMs button once (or a setting) after a few minutes (or a setting). Or it could start the new job as pending or delayed and start after a few minutes allowing for time to cancel it.