Posts made by nvoss | XCP-ng and XO forum

nvoss

We went the route of copying all the VMs, deleting the old, starting up the new versions. Snapshots all working after machines created.

This weekend my fulls ran and 4 of the machines continue to have VDI chain issues. The other 4 back up correctly now.

Any other thoughts we might pursue? The machines still failing don't seem to have any consistency about them. Windows 2022, 11, and linux in the mix. All 3 of my hosts are in play. 2 are on 1 and the other 2 machines are split amongst the other two hosts. All hosts are fully up-to-date patch-wise as is XO.

The machines DO backup right now if we choose force restart button in the XO interface for backups. For now. That worked before for a few months then also stopped working. When we do the force restart it does create a snapshot at that time. During the failed scheduled backup the snapshot is not created.

Somehow despite these being brand new copied VMs I see the same things in the SMlog file (attached is a segment of that file where at least 3 of those undo coalesce errors show) where it looks like an undo leaf-coalesce but no other real specifics.

Anything else we can troubleshoot with?

Thanks,
Nick

SMLogExerpt.txt

nvoss

@olivierlambert Unfortunately even when shutdown the migration from NAS to local storage fails with the same SR_BACKEND_FAILURE_109 error.

At this point we're trying copying the VHD, destroy the VM, re-create with the copied VHD on at least one machine.

But it definitely appears to be more than one VM with the problem, unless the one machine causing the problem is preventing coalescing on all the other machines too somehow. The previously identified machine is the one I'm starting with.

nvoss

@dthenot the plot thickens!

If we try to migrate storage on one of these machines we get SR_BACKEND_FAILURE_109 with snapshot chain being too long. So coalescing is definitely not working right.

Do we have decent options to clean it up? The machines themselves are working fine. Manual snapshots are fine, but even force run of the backup job results in these failures too.

  "result": {
    "code": "SR_BACKEND_FAILURE_109",
    "params": [
      "",
      "The snapshot chain is too long",
      ""
    ],
    "task": {
      "uuid": "71dd7f49-e92c-6e46-8f4c-07bdd39a4795",
      "name_label": "Async.VDI.pool_migrate",
      "name_description": "",
      "allowed_operations": [],
      "current_operations": {},
      "created": "20250612T15:27:02Z",
      "finished": "20250612T15:27:14Z",
      "status": "failure",
      "resident_on": "OpaqueRef:c1462347-d4ae-5392-bd01-3a5c165ed80c",
      "progress": 1,
      "type": "<none/>",
      "result": "",
      "error_info": [
        "SR_BACKEND_FAILURE_109",
        "",
        "The snapshot chain is too long",
        ""
      ],

nvoss

@dthenot

I was able to track down that particular UUID to the machine in question and we took it out of the backup routine. Unfortunately the issue persists and the backups continue to fail their regularly scheduled runs (in this case the delta).

There again though if we use the force restart command in the XO GUI on the backup job then it runs fine -- both full and delta are able to take snapshots, record backup data, and transmit said backup to both remotes.

I'm at a bit of a loss where to go next. My best guess is to try to migrate storage on all the VMs and migrate it back and see if that fixes it, but since removing the one VM didn't fix the issue I'm afraid it may be all of them that are impacting the backup job.

nvoss

@dthenot sorry not sure how the "force restart" button option works for both our full and our delta backups vs the regular scheduled backup jobs because doing the force restart lets the job run fully each time regardless of the specific machine that may have the bad/corrupt disk? That's the orange button

And a manual snapshot works on all machines I believe too?

Is there a smooth way to track that VHD disk GUID back to its machine in the interface?

nvoss

@dthenot every one of our VMs report this same error on a scheduled backup. Does that mean every one has this problem?

I'm not sure how it would've happened? It seems like the problem started after doing a rolling update to the 3 hosts about 2 months back.

I'm also not super clear on what the batmap is -- just a shade out of my depth!

Appreciate all the suggestions though. Happy to try stuff. Migrating the VD to local storage and back to the NAS, etc?

What would make the force restart work when the scheduled regular runs dont?

nvoss

@dthenot sure, here you go!

nvoss

@dthenot when I grep looking for coalesce I don't see any errors. Everything is the undo message.

Looking at the line labeled 3680769 in this case corresponding with one of those undo's I see lock opens, variety of what looks like successful mounts and subsequent snapshot activity then at the end the undo. After the undo message I see something not super helpful.

Attached is that entire region. Below an excerpt.

It's definitely confusing as to why a force on the job works instead of the regular run?

Errored Coalesce.txt

nvoss

@olivierlambert Sure I can try.

I can confirm now though that with both my full and my delta jobs that they fail with every single VM on the "Job canceled to protect the VDI chain" error.

If we do a standard restart then it fails the same way. If we use the "force restart" option then it does work properly and backups seem to finish without issue.

The remote configuration is brand new with encrypted remotes with the multiple data block option selected. The backup job itself is not new, it's been in place for about a year. The job uses VM tags to determine which VMs to backup. The full is a weekly run with 6 retained backups, it remotes to both the external and local. The delta only goes to the local synology and is set with 14 retained backups.

The storage for the VMs is on a Synology NAS. The VMs live on one of 3 hosts with similar vintage hardware.

Per the backup troubleshooting article:
cat /var/log/SMlog | grep -i exception : no results
cat /var/log/SMlog | grep -i error : no results
grep -i coales /var/log/SMlog : lots of messages that say "UNDO LEAF-COEALESCE"

The host I ran those commands on is the one which houses the Xen Orchestra VM (whose backup also fails).

The synology backup remote has 10TB assigned to it with 8.7TB free. The VDI disk volume has 5.4TB of 10TB free.

Status on the hosts patch-wise shows 6 patches are needed currently, though they were up-to-date last week.

XO is on commit 9ed55.

Other specifics I can provide?

Thanks!
Nick

nvoss

Hi All!

We were having problems with our backup remotes not working (on-site Synology, off-site Wasabi) with vdi chain issues. Checked the logs per the referenced article and didn't get anywhere with anything obvious. Noticed we were using an older encryption method with the encrypted remotes. So decide to purge the remotes, setup fresh, and off we go.

Fast forward a week. Full backups went ok on a forced full. We'll see this weekend if it goes well automatically. However deltas continue to fail when run on schedule with the new remotes. I figured for sure that new backups, new snapshots, etc. wouldn't have a coalescing issue. What we've found so far though is "force run" results in a successful backup for the deltas too.

At a bit of a loss on troubleshooting. Anyone else seeing this? Both remotes are encrypted.

Thanks!
Nick

nvoss

@olivierlambert @florent

Of note from ours is we use Wasabi S3-compatible as the remote in one case and a Synology NAS as our local remote in the other. Both of those remotes fail with the unsupported state error when the backups are encrypted.

In the same encrypted job I have the following machines which have a backup size and duration of:

VM1 - 31.55GB - 47 mins
VM2 - 14.51GB - 22 mins
VM3 - 30.28GB - 48 mins
VM4 - 45.33GB - 24 mins
VM5 - FAIL - 1hr 27 min
VM6 - 2.14GB - 4 mins
VM7 - FAIL - 1hr 28 min
VM8 - 35.95GB - 1hr 5 min

The two machines erroring have thin provisioned disks whose size are
VM5 -- 128GB and 100GB which are 10.94GB and 86MB on disk
VM7 -- 123GB and 128GB which are 11.09GB and 10.3MB on disk

At first I thought it was size related or perhaps duration. But what's causing that extra duration for machines of these sizes? Something about activity on the Windows VMs?

Or perhaps that it was related to having multiple disks on Windows machines?

nvoss

@olivierlambert yeah my experience is also that deltas run without error. Though what they're backing up exactly w/o a full in the remote is pretty questionable. I assume its a delta off of the snapshot full, where the snapshot is completed without issue and it's just the copy to encrypted remote that's failing.

These are definitely my larger VMs -- >100gb total disk.

nvoss

@olivierlambert I'm using XO built from sources. Current commit level: 88c7c. I pulled those updates down ~2 days ago I think.

nvoss

@olivierlambert that's my experience as well, but it's inconsistent in that I have like 6 other VMs on the same backup that work fine to the encrypted repos. In this case I have two -- one that's a local synology and one that's a remote s3 compatible Wasabi. Both show the same failure, but only with these 2 VMs.

nvoss

@olivierlambert @florent happy to provide logs or troubleshoot any way that makes sense. The issue definitely seemed to crop up after setting up encrypted repos but that's not to say its the culprit by any stretch. Could easily be coincidental.

nvoss

We've got a host of machines backed up via tag selection. Most of them are fine, but during a full backup I have two that fail everytime. The error is "Trying to add data in unsupported state"

These machines happen to have more than 1 disk associated with them whereas all of our other VMs only have a single OS drive. That's the only thing I've got as a potential issue.

Stack trace has some more info in the JSON log, but I'm not sure where to start with it:

Error: Trying to add data in unsupported state
    at Cipheriv.update (node:internal/crypto/cipher:181:29)
    at /opt/xo/xo-builds/xen-orchestra-202408260910/@xen-orchestra/fs/dist/_encryptor.js:52:22
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async pumpToNode (node:internal/streams/pipeline:135:22)

The machines are a Windows Server 2022 VM and a Windows 11 VM.

Final error is all targets have failed, step: writer.run()

I've tried running the backup on one of these machines when it was shutdown and it had the same ultimate error. The Windows management tools are installed on both machines.

Anyone seen this before and had success at resolution?

Thanks!
Nick

ETA: these machines are set to be on an encrypted backup. Other machines on the same backup work fine. CBT is off.