Overlapping backup schedules - healthcheck vms lead to "UUID_INVALID"

techjeff

Hello,

I have two backup jobs that I attempted to offset to prevent them from running at the same time, but sometimes they take longer than others and they end up overlapping which causes problems when using "smart mode" to match VMs to backup by their tags.

I've noticed that sometimes if health check VM from backup job A is being "restored" while the other is running then I will get the UUID_INVALID error for a single VM that doesn't exist and I suspect that backup job B is attempting to backup the healthcheck vm because it has matching tags, but then the healthcheck vm is deleted after the check is complete which triggers the error I'm seeing.

Obviously, I could make efforts to avoid the two backup jobs running at the same time, but I'm hoping that there may be some sort of tag applied to a healthcheck VM that indicates that it is being used for a health-check which would allow me to configure the "smart mode" to exclude those VMs.

If this isn't already feature, I would like to vote for it being added -- the tag could be something like xo-backup-healthcheck it would be fitting with similar tags.

Any other advice or suggestions are appreciated as well.

Thanks again!

olivierlambert

Hi!

Thanks for the feedback, asking @florent @MathieuRA or @julien-f about this

florent

@techjeff We will try to reproduce, but that is a very detailled bug report, with a credible hypothesis, and a well defined fix

thank you

techjeff

@florent thank you! Please let me know if you would any more information or further assistance from me. As of yet, the scenario I described is a just a theory as I wanted to get feedback about whether it is a reasonable hypothesis before attempting to conclusively replicate it.

Also, I realized the other day that I had a typo in my remote syslog host address (for who knows how long--apparently I don't check my logs often which I'm calling a sign of reliable tools and setup ) so I don't have logs beyond the backup report which doesn't give much more information than the UUID of a VM that doesn't exist anymore..

In any case, now that my logging is fixed, if I see this happen again, I'll try to gather more details and share them.

tjkreidl

@techjeff A cron or similar scheduled job could check if ore than two are running and kill one and keep checking for the other to finish before restarting it.
That, or even easier, in a script running for example via cron/systemd, stack them so the one runs at a given time, and when finished, it goes on to the second backup job.

techjeff

Thanks for the suggestion, @tjkreidl.

I'm not sure what commands I would run with this cron/systemd job/service.

I assume I would need to utilize the XO API calls to determine the list of running backups and then kill the second if the first is still running.. the issue I see with your suggestion is that my backup log would end up with many failures when I currently only get just one, if any.

While this home-lab thing is a hobby and platform for learning, I have a feeling that your suggestion would require that I invest time into learning how to and then building a Rube Goldberg machine that would results in me becoming dependent upon it, or I could let the seemingly amenable devs work on my low-hanging suggested improvement to their relatively new feature: backup health checks. I suppose I could also look into submitting a pull-request

Regardless, these backups don't hold anything critical per se; only the feeling of satisfaction I get from maintaining moderately resilient backups (I can't afford "3-2-1", but I can afford "2") and getting that sweet notification from my xcp-ng hosted internal mail server that the backup was successful. TBH, I could lose "everything" and not really lose anything because I still have the knowledge and experience and it would give me the excuse to practice settings things up again from scratch.

Also, solutions like adding/upgrading hardware to speed up backups are not options at this point in life due to financial, electrical, and space limitations. As it stands, all of my hardware is 5-10+ years old, second-hand (probably 3rd, 4th or more in some cases--several pieces were donated to GoodWill they were so poorly valued several years ago), and I have only a single 20A 120V circuit breaker powering all lights and outlets in the upstairs of the apartment -- the joys of being an American millennial that graduated high school with little familial wealth just before the great recession that has never managed to get a degree

The neat thing is that these computers give me computational power and learning potential while heating our apartment instead of turning on a heater which only consumes money. I really need to move the computational heaters downstairs for more effective heating.. one of these days!

TL;DR - After some consideration I don't think your suggestion fits my use case, but it did provide for a good thought experiment!

tjkreidl

@techjeff Are your backups fired off then manualy or how? Are the command lines? If both of these are true, using crontab is easy as pie and can be mastered in 15 minutes!

techjeff

@tjkreidl I'm using the "Backup" feature of Xen Orchestra that was previously called Backup-ng, IIRC. A person creates a backup job that determines the type of backup, i.e. Delta, Continuous Replication, etc. (Those are the old names, though the terminology is going to be changing with XO6/XOLite), the destination "remote" for the backup, either a discrete list of VMs to backup or "smart mode" which is dynamic based on pools to in/exclude and VM tags to in/exclude, and lastly a schedule which has an option to perform a health check (XO restores the backed up VM to the SR of your choice, waits for it to boot successfully, then deletes the restored VM since it was only temporary and not needed). The schedule displays the equivalent cron job syntax, but I'm not sure whether that is implemented by cron or if it's just displayed like that as a convenience.

AFAIK, the backup tool is a higher-level abstraction built on top of XAPI, but with additional niceties, like health checks in this particular case.

My two overlapping jobs are both using "smart mode" to determine the list of VMs to backup based on the tags assigned to the VMs and they both perform health checks. The first is a Delta backup that starts at midnight and usually completes fairly quickly, but sometimes it runs later than 2am when my other backup job starts (continuous Replication to the local storage of one of my xcp-ng hosts).

The issue I'm encountering is that sometimes the second backup begins before the first is finished and sometimes a healthcheck VM is in the middle of booting which results in the second backup job including that healthcheck VM in the list of VMs that it needs to backup. Later, by the time the second backup gets around to actually backing up the healthcheck VM, that VM will have been deleted (the health check is complete), but the second backup job doesn't know that it was deleted, so when it starts making XAPI calls against that healthcheck VM's UUID, XAPI responds indicating that no VM exists with that UUID and XO reports the INVALID_UUID for that particular VM in the backup. Thankfully the backup job is smart enough to know that only that VM failed and it continues with the other VMs.

tjkreidl

@techjeff Unfortunately, I do not have direct experience with that tool. I would hope it would have some way of independently staggering the startup times when they kick off.
If there is no other option, requesting a way to deal with this as an added feature seems like the best recourse.