Continuous Replication Doesn't Finish

moterpent

I'm wondering if anyone else has had an issue with continuous replication not finishing? Starting sometime around October, CR jobs that had been running unchanged for 1-2 years started having issues. Ultimately one, and on rare occasion, two VM's will never finish their transfer (see screenshot). They remain in the transfer state indefinitely or until xoa process is restarted.

As a consequence, all future runs fail as the job believes, rightly so, that the previous job is still in progress.

There doesn't not seem to be any pattern between which VM it fails on. I've tried re-basing each VM, by removing all CR VM's for a given VM, making sure no files for said VM exist on the SR, and then re-creating the initial full. Regardless, sooner or later CR will time out again.

FWIW, I also run delta backups to the same SR and have no such problems. It's only with CR and only for the past few months.

I know I've seen a related post to this, but I'm struggling to find it again. I'll keep searching and add it to this thread once located.

I'm running from source and will typically pull the latest version 1-2 times per week, looking for updates that may resolve the issue. Currently on the following versions.

Xen Orchestra, commit b0846

xo-server 5.106.1

xo-web 5.107.0

Andrew

@moterpent Do you ever run a full CR (not just a delta)? Do you have enough disk space on both sorce and destination?

I run hourly CR deltas and do a full CR every week. Sometimes they leave snapshots but they always finish.

I using current XCP 8.2.1 and current XO source.

moterpent

@Andrew

Thanks for the response. In my case, the CR and the delta backups run as separate jobs, on different schedules, with different retention. It's my understanding that CR is delta by nature and that's not really an option to do otherwise. Essentially an initial full, subsequent deltas, and then coalesce depending on retention.

How are you automating a full base once per week under CR?

Yes, there is plenty of drive space available (1.2TB) and a delta is only MB to a couple GB.

Andrew

@moterpent Not true. You want CR to be a delta in most cases so the transfer is kept small and can be run more often. My thought is that it's a good idea to do a full CR data transfer occasionally so any delta errors get corrected. I don't think it's a good idea to do only delta updates forever. But it's more of a feeling than something with proof.

The option for full is in the job scheduled to force a full after X number of deltas. If your remote CR system is on a WAN it could take a while to do a full update.

I also have delta backups (not CR) that run from a different job but only once a day. They are normally deltas but I run a full sometimes on them too...

moterpent

@Andrew

Thanks for clarifying again. I had missed that.

I see the option in the schedule, but it's only a checkbox. My guess is you have to combine it with the Settings -> Advanced -> Full Backup Interval input? It's not clear from the docs if those two things are related or separate.

I have mixed feelings about the deltas. There's clearly some thought on the part of someone that someone may need to do the deltas-forever thing due to the ability to create an initial seed for a remote site due to bandwidth constraints. In my case this is a remote NFS store, but it's a low latency (~5ms) and high throughput (1Gbps) connection so it's not too onerous to do a full periodically.

At the end of the day, I've ticked the "Force Full Backup" checkbox and put a value the same as retention in the advanced settings. We'll see how it goes.

I'm still puzzled by why this is a recent issue and didn't used to exist. Also it would be nice if there was some graceful timeout or something that will kindly bring the replication to a close in the event that no data is being transferred for a prolonged period of time.

Gheppy

You can "reset" delta on an CR on two methods ( I use the firs one ) :

Set number of full backup interval, under Advanced settings (left down). And can be predicted as fallow
Reset = full backup interval / number of delta per day.
So for 4 deltas per day
Reset = 21/4 = 5 rest 1, reset will be on day 6 on first backup.

Or on schedule on Replication retention

Force full backup is to force full backup on next run