XOA/XO from Sources S3 backup feature usage/status

JamesG

@florent I can get behind a few more options in the encryption side of the S3. I also think that's a really important feature when using some sort of cloud-based storage. My concern is mainly in how to deal with the keys and DR. What do I do if the XO and VM farm is destroyed? How do I rebuild the farm and recover the data?

The current S3 remote form says to input a 32-character key...But is that an actual key or is that a pass-phrase to generate a key? What pieces do I need to backup and safeguard in order to recover the data from the S3 storage? This feature isn't really documented and it doesn't seem to be fully fleshed out yet. If there's something I can do to help, I'd be glad too.

Caveats from the S3 remote encryption:

All the files of the remote except the encryption.json are encrypted, that means you can only activate encryption or change key on an empty remote.
You won't be able to get your data back if you lose the encryption key. The encryption key is saved in the XO config backup, they should be secured correctly. Be careful, if you saved it on an encrypted remote, then you won't be able to access it without the remote encryption key.
Size of backup is not updated when using encryption.

planedrop

I've been doing testing with S3 (specifically to Backblaze B2 using it's S3 compatible layer) for a while now, and here's what I'll say.

It's definitely reliable and a solid option for backing up small VMs, the big issue with it right now is that it's super slow, especially when doing delta backup merges (in some cases something as simple as a 20GB merge would take 4-6 hours). It's been pretty reliable though and for small VMs is very usable.

The other thing I'll note is that I've had very odd issues with doing NON delta backups (so full backup), where the backup will just time out at around 43GB no matter what, even though the initial delta (when doing delta) will finish just fine.

We have a few VMs with multiple 2TiB VHDs so they are very large, but it'd still be nice to be able to do direct backups, the solution that we have to use right now is backing up to a NAS and having that NAS backup to Backblaze. Only issue with this is that recovery time in a worst case scenario would be a bit longer than being able to just straight download the VMs from Backblaze.

I'm going to keep doing more testing on it but for now it just hasn't been viable for large VMs. Speeds I was seeing were around 20-30 megabits per second whereas TrueNAS can push to B2 at full 2 gigabit.

But I do agree with @florent that it's been stable and reliable enough to go out of beta, just still hope we can see some speed improvements in the future.

JamesG

@planedrop Mr. Lambert will be by shortly to chastise you for having VM's with 2TiB VHD's

I don't know if I would consider something that has occasional failures necessarily out of beta either.

I ran a couple of test backups from XO to B2 and saw 30-40Mb...But that's the cap of that sites current upstream (shopping for some better bandwidth...But that's what the site has for now) so it didn't trigger any alarms for me. I think if that's a speed limitation in the XO implementation, that would be a potential problem as well. Have you tested this to an actual AWS S3 bucket and observed the same speed issue from XO?

I've thought about building a TrueNAS as a backup target as I know it can replicate to B2 fairly well. I just saw the XO S3 integration and thought that might be a usable option without having to add more hardware.

Thanks for your feedback planedrop!

Andrew

I have been using S3 backup for a while and the last big XO code update for S3 backup cleared up all the issues I was having. I see about 60-70Mbits/sec Delta Backup to Wasabi S3. The file restore from S3 backup was a great feature addition.

You can also use S3 backup locally to MinIO server (linux or TrueNAS),

planedrop

@JamesG I've mentioned it many times before and never been "chastised" for it. In fact we have a few VMs with more than 1 2TiB VHD that is spanned within Windows so we can have a bigger than 2TiB disk and it works great.

As for occasional failures, I'm not 100% sure if that was related to my setup or not, it was ONLY with full backups and each VM failed at 43 something GB with an HTTP timeout, it was very odd but Delta's of the same VMs would go for many hours without any issues so not quite sure what is going on with it.

I have not tried with an actual S3 bucket but I don't really think that'll make a huge difference, B2 is very very fast (like I mentioned TrueNAS is dumping to it at 2 gigabit without any issues), maybe I'll give it a shot at some point just in case though.

And yeah we have to have TrueNAS at this site for other reasons so it worked out fairly well, but definitely does make a restore event a sort of "double restore" where we would have to redownload to TrueNAS then connect XOA to it. But IMO it's important to have a local backup location anyway rather than just a remote site so I'm doing both even if S3/B2 gets super fast at some point.

planedrop

@Andrew Interesting, I've never been able to get backups at those speeds, even then with how big some of our stuff is we need to be more like 1000Mbit/s to get things done in a reasonable amount of time.

Also, have you let your delta chains do any merging yet? That is where I saw the most issues, I don't mind if backups take a while, but once we hit the number of retentions set, things got REALLY bad lol.

I'll give a better example though:

Set backup retention to say 4, then do 4 backups, once you hit the 5th backup it will backup and then merge the 1st Delta into the full backup, this was the process that was taking super long to complete for us.

For example, one VM setup this way had a delta of only 316MiB, but the oldest remaining delta that needed to be merged into the full was quite a bit larger (I think around 10-20GB). The actual backup only took a few seconds to dump that 316MiB, but then the merge process lasted over an hour before it finished. This also results in a huge amount of transactions on the B2 bucket which does raise costs a bit (though that's a minor complaint as it wasn't making things huge, just slightly more expensive).

I am on the latest XOA but maybe I am missing something here? Maybe @florent has some ideas? If I could direct backup from XOA that'd be great rather than doing this tiered thing I mentioned in my reply above.

JamesG

@planedrop Olivier is of the mindset that the VM/Server should really be the OS/Application and that the large data should be on some other storage. This keeps your VM's light and agile. Backups and migrations of the hosts are "fast" then. Let the storage subsystem do the heavy lifting of the storage. It's a more cloud-centric way of thinking than what we've traditionally done and makes upgrading servers/apps potentially less painful having the data-stores separate from the actual VM/Server.

Back on topic....I can try S3 to B2 and AWS from another site with more bandwidth and see how that goes and report back.

For Andrew...When you say the "last big XO update" what are you referring to exactly? Which version specifically? I ran an update to the local XOfS instance on-site over the weekend, so presumably that system has the update.

Thanks!!!

planedrop

@JamesG Oh yes, that's something I totally get actually and don't disagree with at all, and we do function that way for some of our larger data needs (we have PB level data requirements for some apps). But for a few specific setups we need larger local disks to the machines for some legacy applications etc.... This should be changing in the future but for now it's a requirement of ours. As of now we have quite a few large VMs and only one of them could have it's data offloaded to an SMB share or something of the sort, but I just haven't taken the time to do it to that one VM.

I'd definitely be interested in your results to B2, maybe there is something wonky with my setup. I would emphasize the important thing though is to run Deltas and let them get to the point of merging to the original full (so let them go longer than the retention number) since that was the area that was super slow for us.

Andrew

@JamesG I have been running S3 delta backups to Wasabi for about 18 months as an off-site backup. I keep about 2 weeks of backups and only do a true full backup once a quarter. I also keep an hourly CR and have other normal inband OS backups too. This mostly meets the normal 3-2-1 backup standard.

The last "big" S3 backup update for XO was about 6 months ago and helped solve a lot of merge issues and orphaned backup data that was no longer needed. It added better verification and cleanup of S3 data. That update caused a lot of dead data cleanup and I had to delete some stuff manually to get it back on track. Since then S3 backups have been working consistently for me.

JamesG

@Andrew Thanks for that added detail.

Your success to Wasabi is encouraging. Perhaps Planedrops performance issues with BackBlaze B2 is related to a specific combination of implementation of S3 between BackBlaze and XO.

Things to test:

XO to AWS
XO to Wasabi
XO to BackBlaze

Theoretically, the performance should be the same to all S3 endpoints.

planedrop

@JamesG I'll do some more testing with B2 as well to see if I can improve things at all, will try to do a better job logging information about what exactly happened and how long those things took.

I also still need to test this writeblockconcurrency setting that @florent has mentioned in the past, though I'm still having trouble finding the right config spot to put it in.

olivierlambert

Indeed, as told by everyone here, you can have variable performance depending on your providers, even depending on their own internal load. Merging will mostly do files rename and deletion, so if you provider isn't fast on doing that, that might explain the poor merging speed.

florent

I'll try to answer to the questions

@JamesG the 32 chars are the key , you need to keep this key, if not you won't be able to restore anything. IF the key is wrong you'll have an error message when connecting to the remote.

I am glad to see that S3 is now stable enough for you all. To improve it again, we are planing to implement retry every times it is possible. It is already implemented during the reading of the backup in NBD, please give it a try.

There is also a test branch ( https://github.com/vatesfr/xen-orchestra/pull/6840 ) that should improve concurrency handling when using NBD reading and writing to S3
For now, you should test it on a separate job with a big VM

The merge is mostly a copy and delete of all the block (1-2MB each) that are not used anymore. This step could probably by speed up further, but we prioritized reliability here, since it can break the full backup chain it something go wrong ( and thanks @andrew for your time on this feature). You can set mergeBlockConcurrency to a higher value ( like 8 ) in the backups section of your config file

You can also increase writeBlockConcurrency in the backup section of the config file to speed up transfer, especially when coupled with the test branch

The easiest way to ensure your config file is not reset on each update is to create the config in ~/.config/xo-server/config.toml (use the home directory of the user running XO )
If you have a xoa, use /etc/xo-server/config.toml

[backups]
writeConcurrency = 32
mergeConcurrency = 8

fbeauchamp opened this pull request in vatesfr/xen-orchestra

draft feat(@xen-orchestra/backups): use parallel reading whith NBD + blocks to only one remote #6840

planedrop

@florent Did some testing with this and wanted to let you know how long it's taking, I haven't tested the test branch yet but I am using NBD.

I had a VM with a 25 gibibyte delta that needed to be merged to Backblaze B2, this is on a 200Mbps upload connection, the upload of the new snapshot only took a few seconds (it was like 100 kibibytes), but the merge of the previous 25 gibibyte one took 2.5 hours to complete, does this seem normal?

JamesG

Just to make sure I'm understanding the backup side of XO...

Backup retention is how many backups will be kept on the remote, and any backed up data that's older than the retention number should be removed automatically by the backup process?

For example, a "full backup" schedule that runs daily with a retention of 2, should only ever have two backups on the remote?

If XO cleans up behind itself...What exactly is it keying off of to determine what "old" files to delete?

I ask because it doesn't look like XOfS is doing any house-keeping on S3 storage (specifically BackBlaze B2).

For example, I started a daily full backup schedule with a retention of 2 on 5-21-23. As of today, all backups were still in the bucket. Before the job ran, I manually removed everything from the bucket that had file dates up to 20230526*. After the job completed, I checked and I still had backups from 20230527* on to the expected 20230531* for today. I changed retention to 3 and set "delete before backup" and executed again, but I just ended up with another 20230531* backup set. I did notice that the files themselves were coded with the day of the backup, but that the actual date on the files was within the past two days...Even if the file was an older file.

Example:

20230527T040007Z.json (2) * 14.1 KB 05/29/2023 00:04
20230527T040007Z.json.checksum (hidden) 0 bytes 05/29/2023 00:04
20230527T040007Z.xva (2) * 995.5 MB 05/29/2023 00:04
20230527T040007Z.xva.checksum (2) * 36.0 bytes 05/29/2023 00:04
20230528T040008Z.json (2) * 14.1 KB 05/30/2023 00:06
20230528T040008Z.json.checksum (hidden) 0 bytes 05/30/2023 00:06
20230528T040008Z.xva (2) * 1.0 GB 05/30/2023 00:06
20230528T040008Z.xva.checksum (2) * 36.0 bytes 05/30/2023 00:06

This could just be a BackBlaze specific thing that they're doing. As you can see though, the file names indicate the date/time XO created them, but the BackBlaze file (system?) date is two days later. If XO is looking at the remote filesystem date, then this makes sense why those older backups are still retained. However if XO is looking at the filenames it creates, then I would expect it to have cleared off the older backups.

This also begs a question...If the retention is set, is the retention the number of copies, or is the retention the number of scheduled cycles? If copies, then presumably manually executing a daily backup a couple of times in a row would clean up the previous two days of backups. If cycles...Then presumably a retention of "2" for daily backups would mean it would keep all backups less than two days old. If the retention is "8" for an hourly backup, then any backups older than 8 hours would be cleared off.

The cycles method based on remote file system dates makes more sense to me and is what I would suspect XO is doing. In my case with BB, it would just appear that something strange is happening on their file system that is throwing the dates off.

florent

@planedrop said in XOA/XO from Sources S3 backup feature usage/status:

@florent Did some testing with this and wanted to let you know how long it's taking, I haven't tested the test branch yet but I am using NBD.

I had a VM with a 25 gibibyte delta that needed to be merged to Backblaze B2, this is on a 200Mbps upload connection, the upload of the new snapshot only took a few seconds (it was like 100 kibibytes), but the merge of the previous 25 gibibyte one took 2.5 hours to complete, does this seem normal?

the merge duration is depending on the size of the vhd being merged, so it depends on the 2 olders backups size, not the last one
Merge is quite expensive, we pay here the cost of not transferring all the data all the time, and not growing storage used infinitely

florent

@JamesG said in XOA/XO from Sources S3 backup feature usage/status:

Just to make sure I'm understanding the backup side of XO...

Backup retention is how many backups will be kept on the remote, and any backed up data that's older than the retention number should be removed automatically by the backup process?

For example, a "full backup" schedule that runs daily with a retention of 2, should only ever have two backups on the remote?

If XO cleans up behind itself...What exactly is it keying off of to determine what "old" files to delete?

I ask because it doesn't look like XOfS is doing any house-keeping on S3 storage (specifically BackBlaze B2).

For example, I started a daily full backup schedule with a retention of 2 on 5-21-23. As of today, all backups were still in the bucket. Before the job ran, I manually removed everything from the bucket that had file dates up to 20230526*. After the job completed, I checked and I still had backups from 20230527* on to the expected 20230531* for today. I changed retention to 3 and set "delete before backup" and executed again, but I just ended up with another 20230531* backup set. I did notice that the files themselves were coded with the day of the backup, but that the actual date on the files was within the past two days...Even if the file was an older file.

Example:

20230527T040007Z.json (2) * 14.1 KB 05/29/2023 00:04
20230527T040007Z.json.checksum (hidden) 0 bytes 05/29/2023 00:04
20230527T040007Z.xva (2) * 995.5 MB 05/29/2023 00:04
20230527T040007Z.xva.checksum (2) * 36.0 bytes 05/29/2023 00:04
20230528T040008Z.json (2) * 14.1 KB 05/30/2023 00:06
20230528T040008Z.json.checksum (hidden) 0 bytes 05/30/2023 00:06
20230528T040008Z.xva (2) * 1.0 GB 05/30/2023 00:06
20230528T040008Z.xva.checksum (2) * 36.0 bytes 05/30/2023 00:06

This could just be a BackBlaze specific thing that they're doing. As you can see though, the file names indicate the date/time XO created them, but the BackBlaze file (system?) date is two days later. If XO is looking at the remote filesystem date, then this makes sense why those older backups are still retained. However if XO is looking at the filenames it creates, then I would expect it to have cleared off the older backups.

This also begs a question...If the retention is set, is the retention the number of copies, or is the retention the number of scheduled cycles? If copies, then presumably manually executing a daily backup a couple of times in a row would clean up the previous two days of backups. If cycles...Then presumably a retention of "2" for daily backups would mean it would keep all backups less than two days old. If the retention is "8" for an hourly backup, then any backups older than 8 hours would be cleared off.

The cycles method based on remote file system dates makes more sense to me and is what I would suspect XO is doing. In my case with BB, it would just appear that something strange is happening on their file system that is throwing the dates off.

To clean the backup it looks at the date only to sort which one are older. The retention is the number of backup kept ( in this case full backup) , it does not depend on their age. For example if you disable the backup and reenable it, it will ony clean the older one.
For each full backup it creates a metadata file (mainly information on the backup job for full backup), a xva file ( which contains the VM data), and a checksum to ensure the file is not corrupted

JamesG

@florent

Attempting to confirm what's expected vs what's observed....

If retention is the number of backups kept, regardless of the date, then if I had a retention of 2, and ran 5 consecutive backups, only the last two backups should remain on the remote?

florent

@JamesG yes, exactly