Backup Fail: Trying to add data in unsupported state

nmadunich

@olivierlambert Yes I get the same result on stable vs latest channel.

@florent The ones that fail do seem to be some of my larger VMs the Windows 10 VM that I have been testing with is about 88.3 GB used according to the OS.

All of my VMs are thin provisioned and our NetApp storage is using de-duplication so the size of the VHD on my storage is significantly less in this case it was about 3 GBs.

As a test I created a new storage volume without thin provisioning and de-duplication. I migrated the Disk to the new volume and the VHD is 103 GB. I also removed de-duplication and compression on my remote. I tried the backup again and it failed with the same error.

I do see some errors from the xensource.log around the time it fails and I attached those here.

xensource errors.txt

I am editing my post after looking at the log file @Delgado posted mine are slightly different. I added mine for comparison. At some point during my testing the error also changed slightly and started stating VDI must be free or attached to exactly one VM. It appears after a failed backup it's not cleaning up the snapshots.

2024-08-29T21_32_42.161Z - backup NG.json.txt

2024-08-28T15_05_12.613Z - backup NG.json.txt

Delgado

Hello,

My vms are about 150G each. I was using compression when I backed up the vm to the remote before mirroring it to the s3 bucket. I did end up changing to delta backups and the error did go away but I can create another normal backup and mirror it to the bucket again to see if I get the same results.

olivierlambert

That's interesting So it's only with full backup (XVAs) then.

nvoss

@olivierlambert yeah my experience is also that deltas run without error. Though what they're backing up exactly w/o a full in the remote is pretty questionable. I assume its a delta off of the snapshot full, where the snapshot is completed without issue and it's just the copy to encrypted remote that's failing.

These are definitely my larger VMs -- >100gb total disk.

daniel.grimm

Hi,

Same problem here.
Its an encrypted S3 remote to Backblaze.
Full mirror backup with selected VMs.
Small VM like Xen-Orchestra works. As soon as a large VM is added (approx. 500GB), the error occurs after about 3 hours.
Tried several times.

xen-orchestra build from source

olivierlambert

And with or without backup compression?

daniel.grimm

sorry... i forgot...
with zstd compression

olivierlambert

Can you try without it and report?

daniel.grimm

Yes.

iam making a non compressed backup now.
And then i try to mirror it with the mirroring job.

Report follows... But uncompressed backup and upload will need some time

daniel.grimm

So, same error after 3 hours of uploading/mirroring an uncompressed backup to the encrypted backblaze remote.

transfer
Start: 2024-09-17 07:27
End: 2024-09-17 10:33
Duration: 3 hours
Error: Trying to add data in unsupported state

olivierlambert

I have the feeling it might be related to Backblaze and a potential timeout or something

daniel.grimm

Before this error, I had the following error:

transfer
        Start: 2024-09-11 15:14
        End: 2024-09-11 16:07
        Duration: an hour
        Error: no tomes available
    Start: 2024-09-11 15:14
    End: 2024-09-11 16:07
    Duration: an hour
    Error: no tomes available

Start: 2024-09-11 15:14
End: 2024-09-11 16:07
Duration: an hour
Error: no tomes available
Type: full

I was able to fix this by giving the xen-orchestra vm more RAM.
I thought these were triggered by some kind of timeout.

When the current error first occurred, I doubled the RAM again. Unfortunately that didn't help

olivierlambert

Error: no tomes available

Never heard of this before.

olivierlambert

It seems to come from BackBlaze, eg https://github.com/mastodon/mastodon/issues/30030

Sadly, I'm not sure this is something we are able to fix on our side

olivierlambert

It might be related to BackBlaze being overloaded at some point. Our advice:

reduce backup concurrency
reduce block concurrency during upload (writeblockConcurrency) and merge (mergeBlockConcurrency) in the config.toml

nvoss

@olivierlambert @florent

Of note from ours is we use Wasabi S3-compatible as the remote in one case and a Synology NAS as our local remote in the other. Both of those remotes fail with the unsupported state error when the backups are encrypted.

In the same encrypted job I have the following machines which have a backup size and duration of:

VM1 - 31.55GB - 47 mins
VM2 - 14.51GB - 22 mins
VM3 - 30.28GB - 48 mins
VM4 - 45.33GB - 24 mins
VM5 - FAIL - 1hr 27 min
VM6 - 2.14GB - 4 mins
VM7 - FAIL - 1hr 28 min
VM8 - 35.95GB - 1hr 5 min

The two machines erroring have thin provisioned disks whose size are
VM5 -- 128GB and 100GB which are 10.94GB and 86MB on disk
VM7 -- 123GB and 128GB which are 11.09GB and 10.3MB on disk

At first I thought it was size related or perhaps duration. But what's causing that extra duration for machines of these sizes? Something about activity on the Windows VMs?

Or perhaps that it was related to having multiple disks on Windows machines?

olivierlambert

It might be a different problem (zstd compression failing on the host) vs a problem with the S3 provider. That's why I'd like to sort the two things.

daniel.grimm

@olivierlambert said in Backup Fail: Trying to add data in unsupported state:

It might be related to BackBlaze being overloaded at some point. Our advice:

reduce backup concurrency

reduce block concurrency during upload (writeblockConcurrency) and merge (mergeBlockConcurrency) in the config.toml

yesterday I reduced writeblockConcurrency to 12 and started the backup.
Same error. I will try some other values.

Here is the error message from the orchestra.log file:

2024-09-18T09:50:27.217Z xo:backups:worker INFO starting backup
2024-09-18T12:57:16.979Z xo:backups:worker WARN possibly unhandled rejection {
  error: Error: Trying to add data in unsupported state
      at Cipheriv.update (node:internal/crypto/cipher:186:29)
      at /root/git-down/xen-orchestra/@xen-orchestra/fs/dist/_encryptor.js:52:22
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async pumpToNode (node:internal/streams/pipeline:135:22)
}
2024-09-18T12:57:21.817Z xo:backups:AbstractVmRunner WARN writer step failed {
  error: Error: Trying to add data in unsupported state
      at Cipheriv.update (node:internal/crypto/cipher:186:29)
      at /root/git-down/xen-orchestra/@xen-orchestra/fs/dist/_encryptor.js:52:22
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async pumpToNode (node:internal/streams/pipeline:135:22),
  step: 'writer.run()',
  writer: 'FullRemoteWriter'
}
2024-09-18T12:57:22.065Z xo:backups:worker INFO backup has ended
2024-09-18T12:57:22.076Z xo:backups:worker INFO process will exit {
  duration: 11214858233,
  exitCode: 0,
  resourceUsage: {
    userCPUTime: 1092931109,
    systemCPUTime: 108325008,
    maxRSS: 404280,
    sharedMemorySize: 0,
    unsharedDataSize: 0,
    unsharedStackSize: 0,
    minorPageFault: 2966382,
    majorPageFault: 2,
    swappedOut: 0,
    fsRead: 134218296,
    fsWrite: 0,
    ipcSent: 0,
    ipcReceived: 0,
    signalsCount: 0,
    voluntaryContextSwitches: 2662776,
    involuntaryContextSwitches: 1238267
  },
  summary: { duration: '3h', cpuUsage: '11%', memoryUsage: '394.8 MiB' }
}

daniel.grimm

I have now tested several times with several different values. But getting the same result with every attempt. The error occurs after about 3 hours.

And I don't think it's a Backblaze bug.

For testing purposes, I installed a local Minio server and added it as an encrypted remote in Xen Orchestra.
The same error occurs. The error occurs every time after about 12 - 13 minutes.

In my test job(full mirroring with selected vms) are 2 VM Backups. A small one (xo with about 7GB) that is mirrored correctly on the first try on both remotes (Minio and Backblaze).
The error occurs after about the same amount of time every time when mirroring the large VM (tried various large VM backups...).

Then I created another minio remote(another bucket) without encryption and run the same backup mirror job to the unencrypted remote.
And this time, it went through without any errors...

So it must be a bug related to S3 remotes, large VMs, full mirroring and encryption!

olivierlambert

I'd love to see if you have the same error with AWS S3, because that would tremendously help to debug.