nvoss

nvoss

We went the route of copying all the VMs, deleting the old, starting up the new versions. Snapshots all working after machines created.

This weekend my fulls ran and 4 of the machines continue to have VDI chain issues. The other 4 back up correctly now.

Any other thoughts we might pursue? The machines still failing don't seem to have any consistency about them. Windows 2022, 11, and linux in the mix. All 3 of my hosts are in play. 2 are on 1 and the other 2 machines are split amongst the other two hosts. All hosts are fully up-to-date patch-wise as is XO.

The machines DO backup right now if we choose force restart button in the XO interface for backups. For now. That worked before for a few months then also stopped working. When we do the force restart it does create a snapshot at that time. During the failed scheduled backup the snapshot is not created.

Somehow despite these being brand new copied VMs I see the same things in the SMlog file (attached is a segment of that file where at least 3 of those undo coalesce errors show) where it looks like an undo leaf-coalesce but no other real specifics.

Anything else we can troubleshoot with?

Thanks,
Nick

SMLogExerpt.txt

nvoss

@olivierlambert Unfortunately even when shutdown the migration from NAS to local storage fails with the same SR_BACKEND_FAILURE_109 error.

At this point we're trying copying the VHD, destroy the VM, re-create with the copied VHD on at least one machine.

But it definitely appears to be more than one VM with the problem, unless the one machine causing the problem is preventing coalescing on all the other machines too somehow. The previously identified machine is the one I'm starting with.

nvoss

@dthenot the plot thickens!

If we try to migrate storage on one of these machines we get SR_BACKEND_FAILURE_109 with snapshot chain being too long. So coalescing is definitely not working right.

Do we have decent options to clean it up? The machines themselves are working fine. Manual snapshots are fine, but even force run of the backup job results in these failures too.

  "result": {
    "code": "SR_BACKEND_FAILURE_109",
    "params": [
      "",
      "The snapshot chain is too long",
      ""
    ],
    "task": {
      "uuid": "71dd7f49-e92c-6e46-8f4c-07bdd39a4795",
      "name_label": "Async.VDI.pool_migrate",
      "name_description": "",
      "allowed_operations": [],
      "current_operations": {},
      "created": "20250612T15:27:02Z",
      "finished": "20250612T15:27:14Z",
      "status": "failure",
      "resident_on": "OpaqueRef:c1462347-d4ae-5392-bd01-3a5c165ed80c",
      "progress": 1,
      "type": "<none/>",
      "result": "",
      "error_info": [
        "SR_BACKEND_FAILURE_109",
        "",
        "The snapshot chain is too long",
        ""
      ],

nvoss

@dthenot

I was able to track down that particular UUID to the machine in question and we took it out of the backup routine. Unfortunately the issue persists and the backups continue to fail their regularly scheduled runs (in this case the delta).

There again though if we use the force restart command in the XO GUI on the backup job then it runs fine -- both full and delta are able to take snapshots, record backup data, and transmit said backup to both remotes.

I'm at a bit of a loss where to go next. My best guess is to try to migrate storage on all the VMs and migrate it back and see if that fixes it, but since removing the one VM didn't fix the issue I'm afraid it may be all of them that are impacting the backup job.

nvoss

@dthenot sorry not sure how the "force restart" button option works for both our full and our delta backups vs the regular scheduled backup jobs because doing the force restart lets the job run fully each time regardless of the specific machine that may have the bad/corrupt disk? That's the orange button

And a manual snapshot works on all machines I believe too?

Is there a smooth way to track that VHD disk GUID back to its machine in the interface?

nvoss

@dthenot every one of our VMs report this same error on a scheduled backup. Does that mean every one has this problem?

I'm not sure how it would've happened? It seems like the problem started after doing a rolling update to the 3 hosts about 2 months back.

I'm also not super clear on what the batmap is -- just a shade out of my depth!

Appreciate all the suggestions though. Happy to try stuff. Migrating the VD to local storage and back to the NAS, etc?

What would make the force restart work when the scheduled regular runs dont?

nvoss

@dthenot sure, here you go!

nvoss

@dthenot when I grep looking for coalesce I don't see any errors. Everything is the undo message.

Looking at the line labeled 3680769 in this case corresponding with one of those undo's I see lock opens, variety of what looks like successful mounts and subsequent snapshot activity then at the end the undo. After the undo message I see something not super helpful.

Attached is that entire region. Below an excerpt.

It's definitely confusing as to why a force on the job works instead of the regular run?

Errored Coalesce.txt

nvoss

@olivierlambert Sure I can try.

I can confirm now though that with both my full and my delta jobs that they fail with every single VM on the "Job canceled to protect the VDI chain" error.

If we do a standard restart then it fails the same way. If we use the "force restart" option then it does work properly and backups seem to finish without issue.

The remote configuration is brand new with encrypted remotes with the multiple data block option selected. The backup job itself is not new, it's been in place for about a year. The job uses VM tags to determine which VMs to backup. The full is a weekly run with 6 retained backups, it remotes to both the external and local. The delta only goes to the local synology and is set with 14 retained backups.

The storage for the VMs is on a Synology NAS. The VMs live on one of 3 hosts with similar vintage hardware.

Per the backup troubleshooting article:
cat /var/log/SMlog | grep -i exception : no results
cat /var/log/SMlog | grep -i error : no results
grep -i coales /var/log/SMlog : lots of messages that say "UNDO LEAF-COEALESCE"

The host I ran those commands on is the one which houses the Xen Orchestra VM (whose backup also fails).

The synology backup remote has 10TB assigned to it with 8.7TB free. The VDI disk volume has 5.4TB of 10TB free.

Status on the hosts patch-wise shows 6 patches are needed currently, though they were up-to-date last week.

XO is on commit 9ed55.

Other specifics I can provide?

Thanks!
Nick

nvoss

Hi All!

We were having problems with our backup remotes not working (on-site Synology, off-site Wasabi) with vdi chain issues. Checked the logs per the referenced article and didn't get anywhere with anything obvious. Noticed we were using an older encryption method with the encrypted remotes. So decide to purge the remotes, setup fresh, and off we go.

Fast forward a week. Full backups went ok on a forced full. We'll see this weekend if it goes well automatically. However deltas continue to fail when run on schedule with the new remotes. I figured for sure that new backups, new snapshots, etc. wouldn't have a coalescing issue. What we've found so far though is "force run" results in a successful backup for the deltas too.

At a bit of a loss on troubleshooting. Anyone else seeing this? Both remotes are encrypted.

Thanks!
Nick

nvoss

@nvoss

Latest posts made by nvoss