Backing up the VM that is running Xen Orchestra

techjeff

Thanks again for amazing solution that I can use at home for learning purposes despite my lack of a corporate budget for support!

I recently updated some of my backup jobs to include the VM that runs my Xen Orchestra that was built from sources.

Assuming everything would work fine (this is all at home and I'm learning, so its not mission critical) I waited a few days to check back in on the status of my backups. Lo and behold all of the backup jobs that included my Xen Orchestra VM were being marked as interrupted. IT appears that taking the initial snapshot of the Xen Orchestra VM is breaking the backup job, which in retrospect makes sense.. if you tell XAPI to take a snapshot of XO, execution pauses momentarily, and this seems to bork the backup job that XO is managing.

So, this leads me to ask what is the recommended way to backup my XO vm?

I have briefly brainstormed the following idea and I would like to know if it makes sense or if there is a better way that I'm not aware of.

Create a XO Job that boots up a duplicate XO vm (let's call it XO2) in the ballpark of 10 minutes before XO2's scheduled back up of only my primary XO vm (let's call it XO1)
As per the schedule of XO2's backup job, it will backup XO1 (I will schedule to ensure that XO1 will not be running any backup jobs)
Some time in the future (long after the backup of XO1 should be complete) shut down the XO2 vm since it is no longer needed.

Any suggestions or feedback are greatly appreciated.

Thank you.

techjeff

@Andrew @Danp - I discovered my boneheaded oversight

I have been developing on a one-shot systemd service with a timer for automatically renewing my short-lived XO certificate daily between a prescribed window of time using my local Step CA instance's ACME provisioner. It has been a fun project and I now have mostly automatic renewal of short-live certificates using what is essentially a private letsencrypt server that uses the CA that I generated.

I just looked at the logs and realized that I completely overlooked the fact that the the step-renewer.service is being run starting at midnight + random number of seconds between 0 and 5 minutes. This service will get a new certificate and then it restarts xo-server.service and that is what has been interrupting my backups.

Facepalm

With mixed feelings, I can confidently report that this issue is entirely self-induced. Thanks for the assistance narrowing down the cause!

Danp

I backup my XO VM along with all my other VMs, and I haven't encountered this issue. What type of backup are you performing?

techjeff

@Danp thanks for the quick reply.

I have a daily delta at midnight every day with a full backup every 7 deltas. I also have a daily backup of pool metadata that was previously scheduled also at midnight, but I decided to move it to 11:55pm as a test to see whether it is the culprit (it rarely takes longer than 1 minute to complete).

I also I have three weekly continuous replication tasks, one for each ~~remote~~ SR, on Monday, Tuesday, and Wednesday respectively. I decided to create them in separate jobs because I want the vms to be named according to the ~~remote~~ SR to which they've been replicated. I have had them all in the same job in the past that created multiple VMs with nearly identical names which made it difficult to quickly ascertain where the VM lived.

These continuous replication jobs were previously running at 2am since the daily delta backups have completed properly in the past within 2 hours, but this morning I rescheduled them to run at 3am to give another hour for the weekly full backup from the daily deltas to complete.

The weekly continuous replication and the daily delta backups contain mostly the same VMs, though I believe the daily delta also contains some that are not replicated weekly.

Andrew

@techjeff I also run backups that include my XO source VM (Debian 11) that is running the backup. It all works fine for me. I use CR hourly and delta backup to S3 every night. I also use full backups to NFS. Everything works fine, including concurrent backups types on the same VMs.

I have had issues backing up a VM and migrating it concurrently.

Do you have the Xen Guest Tools correctly installed (on all VMs)?

FYI, current XO Source master has a problem with CR for some people.

techjeff

@Andrew thanks for the information. I'll have to check the versions of the guest tools installed on the VMs that were marked as interrupted. Some of my VMs are on Debian 10 and some are on Debian 11, some are running windows server 2019 and others are running windows 10. The windows hosts are using the XCP-NG guest tools not those provided by Citrix.

Interesting details that I didn't note before:

It isn't always the same VMs that get interrupted,
Sometimes Debian VMs are interrupted and sometimes Windows VMs are interrupted, and
manually restarting the backup jobs for those that failed has been successful, but I have been coming back in the morning to find them interrupted.

I will also look for a patterns regarding the VMs that are interrupted and gather the Debain OS and guest tools versions as well and report what seems interesting or noteworthy.

Do you know where I can confirm what is the latest version of guest tools available for a given guest OS?

Lastly, I did see in another post that there was an issue with CR. I will look deeper into that thread as well.

My XO is currently at commit 4bf81 which was committed "last week" as far as GitHub will tell me on my phone.

Andrew

@techjeff You don't need the latest tools, it would be good to keep them updated. On the General tab for each VM it will list the agent version detected (or none).

If you have random VMs interrupted then it would seem to be more of a network problem between the XO VM and the hosts.

techjeff

@Andrew @Danp - I discovered my boneheaded oversight

I have been developing on a one-shot systemd service with a timer for automatically renewing my short-lived XO certificate daily between a prescribed window of time using my local Step CA instance's ACME provisioner. It has been a fun project and I now have mostly automatic renewal of short-live certificates using what is essentially a private letsencrypt server that uses the CA that I generated.

I just looked at the logs and realized that I completely overlooked the fact that the the step-renewer.service is being run starting at midnight + random number of seconds between 0 and 5 minutes. This service will get a new certificate and then it restarts xo-server.service and that is what has been interrupting my backups.

Facepalm

With mixed feelings, I can confidently report that this issue is entirely self-induced. Thanks for the assistance narrowing down the cause!

Andrew

@techjeff Thanks for the update. Reporting problems and solutions can help someone else!

techjeff

@Andrew of course -- this is how FOSS thrives. I am now back to my original configuration where I am running my CR jobs at 2am and I have removed the tag that excluded my XO vm from the mix, because I definitely want backups of that!

To be clear, it is entirely valid to use XO to backup itself and this makes me happy

olivierlambert

Ah the issue make sense now No worries about asking here, happy to see it works for you now