How to implement the perfect backup strategy with Xen Orchestra?

iverlaek

The whole purpose of a backup is to be able to restore data in case of an accident.
It's important to be able to go back as it was before as soon as possible without worrying about incomplete things or losing important data.

For me that is

you need to know when a backup was made "before the accident happened", that is: chronologically going back until the issue is solved.
you need to be sure you are able to fully restore the data (that is: no missing drives, inconsistancies, corrupt data or or other 'surprises')
If a physical backup storage device is defective, another alternative copy on another storage device needs to be available to.
in case of a hacker attack, a third (offline) copy should be available with full long and short term data.

A couple of weeks ago my home lab network was stroke with lightning. It made me realize how important backups are. I lost network cards, switches, several disks and my Synology NAS.
Although I was able to resore my systems (thanks to my insurance I could buy new systems), I was wondering how to create the optimal backup strategy in Xen Orchestra.

Can you give me hints how to implement "the perfect backup strategy"?
Do you have a script, blog, video or similar explaining on how to implement such strategy?

for me the optimal backup scheme would be:

hourly backups during working hours of the most important VMs with a retention of ..8? (total number of working hours)
All Vms need to be backed up daily, weekly, monthly and yearly. (what's the retention for every period per storage medium?)
All backups need to be split evenly across multiple storage devices.

It would be great if such a backup plan could be defined with a couple of clicks from with Xen Orchestra.
It would make things so much easier for all of us. New VMs should automatically be added to such backup plan.

Current implementation is flexible but requires many steps to create such a backup scheme.
All basic ingredients are already available: full, delta, continuous replication and rolling backups. Therefore it should be easy to define a screen in the UI and implement the perfect backup strategy atomatically.

Now it's a lot of work to create a similar 'optimal' backup scheme.
Mistakes can be made easily as backups can be defined to start whilst another backup still is running. This is causing error messages or even faults (file open, cannot delete etc.).
IMO it's better to have all parameters defined within one single job.

a few additional feature requests or ideas:

please improve the visibility of backup date and time. This isn't clearly visible when trying to restore systems, you have to search for it.
All running servers should be backing up simultaneously their own VMs (it seems to me this is currently not the case, maybe that's your intention of those proxy systems)
All backups should land on all of the available backup servers
(Its possible now, but these are all copies taken at the same time having 2 identical copies., e.g. with a retention of 4 and 3 servers you would have 3x4=12 different copies, less chance to have a corrupt backup).

expand WOL functionality:

Power on/off the nfs servers on demand (example: turn on monthly server, make backup and turn off again).
Power on/off additional sleeping servers to speed up processing of backups, for compressing and/or merge VMs faster.
Turned off systems are also much better protected against hacking attacks.

Documentation request:
Please improve the backup documentation with images of the flow of data during a backup and tips on how to optimize the configuration for fastest possible backups. in short: What are bottlenecks?

(I have read it's a stream, so is XOA involved or is it the xenserver itself doing the hard work)? What networks are used when doing a backup? Is it the same as the migration network? etc.
it's not clear to me how the data flows. It takes seconds to make a snapshot, but I've seen up to 4 hours for a VM to complete (At that moment I didn't see a performance peek on any of the servers nor the VM's nor the network was spiking.) I've never seen my network saturated. It's usually more or less idle with a couple of spikes for a short time.
How should the network be connected/defined, assuming there is a slower and a faster network available with failover atc. Because of some slow speeds I am looking for answers why things are slower as expected.

question:
Does a XCP-NG benchmark exist?
I would like to rank my own system and its configuration against other systems, so I know someone else has configured things more efficiently and I can learn improve also.

fyi.
Hardware: My home lab network consists of 8 nodes/servers:

DELL C6100 with 4 nodes (each node has dual Xeon 6 core processors) all have 64Gb RAM (these systems usualy are turned off to conserve power)
ASUS dual XEON 4 core system with 192Gb RAM
ASUS single intel 8 core i9900 with 64Gb
two mini ITX fanless systems with an dual core i5 and i3 processor and 32Gb RAM
Each node has a 2TB SSD disk for local storage of snapshots, so it takes only a couple of seconds to make a snapshot.
All systems are connected via dual gigabit Ethernet and dual SFP+ 10Gb Fiber
3 synology systems (2 * DS1812+ and 1 * DS1819+) providing NFS storage, the main NFS DS1819+ is also connected via SFP+ 10Gb fibre channel. All have SSD read/write cache.
the NFS systems are on 2 different locations as are the two mini systems.