Backup and the replication - Functioning/Scale

fcgo

Hello everyone,

I am trying to better understand the load/possible bottlenecks implied by the backup and the replication process.

Is there any documentation explaining in deep details how it works (the docs I could find are quite macro on this subject) ?
For example, regarding incremental backups, is the compression done by the source host, or by XOA ?
Is the algorithm zstd or brotli ?
And how do you scale up if you have more than 1000 VMs ? Do you distribute the backup jobs between multiple XOA, or upgrade XOA ressources, dom0 ressources ?
Do XO proxies have less abilities than XOA regarding backup and replication capabilities, and need to "lean" on XOA for specific functions ?

I know this is a lot of questions
Keep up the great work.
Thank you from Paris

Pilow

@fcgo hello there,

from my experience : XOA proxies are just some stripped downed XOA. no limitation on backup functionnalities.
I'm ok with you, it lacks from in depth informations on how the backups are done...

only place to find compression informations is in Disaster Recovey jobs.

we upgraded CPU & RAM of our XOA, but offloaded all backup tasks to proxies
if you have 1000+ VMs i think you already have at least 16Gb of RAM for dom0 (or you have 500 hosts with 2VMs each and stick with the default...)

assigning jobs to proxies is kind of manual (for exemple veeam can manage a pool of proxies and take the least occupied or chosen one(s) in the pool)

if you implement proxies, you will be confronted to assign them to a job AND subsenquently assign them to remotes too ! you have to have a good planification of what you want to be done (a remote locked to a proxy is not seen by other proxies... sometimes need to create the SAME remote twice to attach it to two proxies... and be sure not to run in parallele on these two...)

planification of backups of 1000VMs is something.

olivierlambert

Adding @florent and @bastien-nollet in the loop for reading those ideas on better reports and proxy automation on assignation.

florent

@fcgo

compression:
on incremental : only if you use the mode "block" on the remote setting, it is done by xoa ( or proxy ) using brotli with setting BROTLI_MIN_QUALITY . The goal is to compress the empty parts of the data blocks exported as zeroes
on full: done by the host, with gzip or zstd
generally the limiting factor is the individual export speed of the xapi, you scale by increasing concurrency . increasing Dom0 resource is good here
latency from host to backup runner is killing performance : proxies are great to have them as near as possible as your source hosts
as Pilow said: proxies are dumbed down XO ( same backup code ) , but lack the scheduling and configuration part. XO is the one launching the
Each backup job use its own process ( so cpu and memory) on XO and proxies, so it's another way to scale

The biggest backup job I saw in the wild ( during a support ) was a few hundreds VMs. Maybe there are bigger one that are working without issue.

fcgo

@florent @pilow
Thank you all, it is clearer.

Can backup copies between sites flow through one proxy directly to another,proxy or does it need to flow from one proxy to XOA to another proxy ?
In this case, could the XOA become a bottleneck ?

Thank you

Pilow

@fcgo when adding an XOPROXY to a job, it flows from XOPROXY to the remote
I think XOA is not involved, checked the network bandwidth, xoa was sleeping

XOPROXY read/writes from source remote to destination remote

fcgo

@Pilow I was asking in the case where the sites are only allowed to communicate via HTTPS, meaning NFS remote at site 1 is accessible through proxy 1, and NFS remote at site 2 is accessible through proxy 2 (XOA being on site 1).

Pilow

@fcgo this is where you reach a limit...
in an XOA backup copy job configuration, you select the proxy, and it drives source AND destination

you cannot have a proxy for SOURCE and a proxy for DESTINATION

Pilow

but each remote (SOURCE and DESTINATION) can be attached to the same proxy... so, the proxy read/writes

and you have to manage your network accordingly so that this proxy can reach each remote (here goes the static routing or vpns...)

fcgo

@florent can you confirm for the replication job ?

Thank you

florent

@fcgo it goes from xapi (= pool/hosts) to xapi , and xapi to remote as of now, you can't chain proxy .

xapi call uses https, that means that replication accross sites uses https . If you do backup, and your backup is in the same site as your NFS, then the xapi<-> proxy path is in https, and the proxy <-> NFS will be local

that is one of the strength ( and complexity ) of xcp-ng infra : everything go through an API ( the xapi ) nobody access host/pool data directly from the outside

fcgo

@florent Hi,

Thank you.

And during backup which host is working hard during the vdi export to XOA (or to the proxy) : is it the host which contains the backed up VM, or is the master also doing some work (other than control channel calls) ? Through which host the network flow is going through ?

florent

@fcgo if the storage is shared : the export is done by one of the host of the pool
If the storage is not the export is done by the host with the storage . Same for the host receiving the data

The command channel, as you said, is always the master to the xoa (and eventually xo-proxy)

So for a replication :

[source SR] => source host =https export call=> xoa / xo-proxy =https import call=> target host => [target SR]

if the xoa is running on the host doing an export, it does not use the physical network
the network use between the host and SR is dependent on the storage used

fcgo

@florent "the export is done by one of the host of the pool"
How is this host selected ? Is it the master one, the one hosting the VM, or the less busy one in case of shared storage ?

fcgo

@florent did you have the the time to check my last question related to how the host (doing the export) is selected ?

Br,

florent

@fcgo AFAIK the host is selected as random maybe @andriy.sultanov know more

andriy.sultanov

@florent @fcgo the host that was queried (with host_ip/export_raw_vdi) will do the export from the shared SR, and will redirect the query to the host that can see the SR otherwise. So it's up to the XO/other orchestrators if they want to distribute the load here.

Vanny

In some third-party backup solutions (for example Vinchin and similar enterprise platforms), proxy pool management and job distribution are fully automated, which helps when managing hundreds or thousands of VMs across sites.

olivierlambert

@vanny you can say you are working for Vinchin, I prefer when it's clearly told

florent

thanks Andryi
We us round robin when using NBD , but to be fair, it does not change the performance a lot in most of the case. The concurrency settings ( multiple connection to the same file ) is helping when there is a high latency between XO and the host.

SO , @fcgo if you have thousand of VMs , you should enable NBD it will consume less resource on the DOM0 and XO , and it will be spread on all the possible hosts.