Best posts made by shorian | XCP-ng and XO forum

shorian

@nraynaud Bizarre isn't it; I'm so very grateful for your efforts.

Some more news - it seems that one big challenge is around concurrency - things improve dramatically if concurrency is set to 1. As soon as something else is running in parallel, we run into the socket failures. I'm expanding things to try your change on another box to see if the outcomes are different - but in summary what I'm seeing so far:

Concurrency = 1 - works fine first time, fails occasionally (20% of the time?) thereafter
Concurrency > 1 - almost impossible to get it to run, but sometimes one or two VMs backup ok but not enough to be predictable and never the entire group
Anything fails - impossible to get a clean run again until the S3 target has been cleaned entirely

So it appears that somewhere there perhaps is a lock occurring when more than one stream is running, and additionally there's some kind of conflict when things have terminated prematurely and the target is therefore not in its expected state on the next run.

shorian

Correct - through the CLI one could see that the old VM that had been migrated from Host1 to Host2 was still showing under Host1, whilst the migrated copy was showing as running on Host2. Once the Host1 remnant of the migration was removed that cleared things and XO correctly reported the VM as running on Host2 with its disks attached.

TLDR - There were no other conflicts beyond what appeared through XO to be the only version sitting halted on Host1, but through the CLI one could see the halted copy on Host1 and the running copy on Host2. Somehow the running version did not show in XO until the remnant was removed.

Thanks for your help @olivierlambert

shorian

Always

shorian

@florent Can't wait!

shorian

Ok, spent weekend having backups running continuously across a number of boxes.

Good news - the fix seems to have solved things, providing one only ever uses concurrency “1” and there are no conflicting or overlapping backups.

Restores are working fine for me too.

In short @nraynaud - it’s a substantial improvement and for me makes this now usable. A huge thank you.

shorian

@dustinb Concur 100%; my current focus is on confirming the error doesn't reoccur and understanding the change in what I'm seeing compared with previous backups, before this goes into production I 100% agree it should be tested for restores. I shall do so myself once I've got confidence that the symptom has been resolved.

shorian

I'll try a fresh install over the next couple of days and see if it reoccurs. Looking at the other boxes, I have the same error on one of the other hosts, but it's not across all of them despite identical installs and hardware.

Thanks for your efforts and help, shame there wasn't an easy answer but let's see if it reoccurs after a completely fresh install.

shorian

@stormi I confess we're now encountering the same error message on nearly all our backups, including CR to a local host. Started from fresh install and cleaned SRs; to avoid memory being the culprit we have upped the memory for Dom0 to 16gb (128gb machine) and XO is running with 16gb of mem of which 12gb is allocated to node.

We've got the same problem occurring across all our hosts. Over 90% of backups error out with the VDI_IO_ERROR, however (weirdly) looking at the target end, I'd say that 75% of the backups 'seem' to complete successfully. Need to restore a couple to find out for sure but confess I've been concentrating on finding out what triggers the error rather than whether it is misleading.

I've gone through the logs in detail and unfortunately nothing jumps out, I'm going to take time to extract the relevant sections from them all to see if you can see something that I can't, but apart from lots of memory messages from squeezed there aren't any obvious errors.

Bizarrely SMlog is pretty clean - it's almost like it receives a termination signal from somewhere rather than erroring out of its own accord - for example tapdisk shuts down with "Closing after 0 errors" and no further explanation. I have found some talk that tapdisk can trigger a target reset after excessive i/o activity but I've not managed to prove that yet.

I'll keep digging into things ; in short it's not something only experienced by @fachex but I haven't yet recreated it in XOA - it's on my todo list.

(If you want me to download all the logs and send them across directly, or to do anything under the covers of XOA, please do let me know. I'm dipping into this when I get time so its not a continuous effort I'm afraid.)

shorian

@zevgeny @olivierlambert For anyone that comes across this under a different guise - we found exactly the same issue and couldn't work out why some VMs were migrating just fine and others were not.

It turned out that we are running a Continuous Replication task which means that there is a UUID conflict when moving the relevant VM across.

So we have Primary Host A & Secondary Host B. The VM is running on the Primary, and we use CR to keep a copy on the Secondary. However, when we want to upgrade the Primary without causing downtime on the VM, we attempt to migrate VM to the Secondary but it fails, as we already have the CR entity on that host.

I can see why this occurs - we are trying to create two versions of the same VM on the same host, however I'd have thought that the Use Case was fairly common. Primary Server running, replicates to a Secondary Server, but wants to move running VM to avoid downtime without having to delete the replicas in case there are any issues. With a copied VM the UUID conflict does not occur, but with the migrate it does.

(In our case, on each host we have a large SATA array for backups that we replicate to, with live VMs running off the SSD array. Hence there is still value in keeping the replica - it protects against disk failure or corruption despite being on the same host for a short period whilst we upgrade the Primary).

Discussion - should one be permitted to update the UUID for replicas (perhaps under advanced settings in Backups?) to avoid these conflicts - or would it cause more widespread issues? Or is our Use Case unique ?

Thanks!

shorian

Update to my earlier post - We found the connection timeout issue was solved by allocating more memory to XO. Even though above I said that the memory didn't appear to be a problem, it turned out that Debian was swapping out so as to keep a chunk of free memory available, so we mistakenly assumed that not using all the memory meant we had sufficient. However being memory restricted combined with a slow disk meant that the swap was growing faster than it was being processed.

Substantially increasing the XO VM memory (4->16gb) seems to have solved the timeout issues (so yes, root cause was user error), and we're now finding that the S3 api to B2 (a lot cheaper than Amazon) is working really well for us.

Well done to the XO dev team - the S3 api has completely changed how we use backups and freed up a lot infrastructure that we previously had had to dedicate to this; thank you

shorian

For anyone else who comes across the same issue, we had this occur with XCP-ng 8.2 , XO-server 5.71.2 which we isolated to using Zstd compression. It was solved by reducing concurrency and assigning a couple of extra vCPUs.

In response to @olivierlambert's point above - it is 100% reproducible if you have an under-resourced XO and fast target remote so the bottleneck becomes the CPU rather than iowait or memory acting as a throttle, but in our instance only for VMs of a reasonable size (>50gb) containing complex databases with lots of incompressible data.

Posts