VDI Chain on Deltas
-
@olivierlambert Sure I can try.
I can confirm now though that with both my full and my delta jobs that they fail with every single VM on the "Job canceled to protect the VDI chain" error.
If we do a standard restart then it fails the same way. If we use the "force restart" option then it does work properly and backups seem to finish without issue.
The remote configuration is brand new with encrypted remotes with the multiple data block option selected. The backup job itself is not new, it's been in place for about a year. The job uses VM tags to determine which VMs to backup. The full is a weekly run with 6 retained backups, it remotes to both the external and local. The delta only goes to the local synology and is set with 14 retained backups.
The storage for the VMs is on a Synology NAS. The VMs live on one of 3 hosts with similar vintage hardware.
Per the backup troubleshooting article:
cat /var/log/SMlog | grep -i exception : no results
cat /var/log/SMlog | grep -i error : no results
grep -i coales /var/log/SMlog : lots of messages that say "UNDO LEAF-COEALESCE"The host I ran those commands on is the one which houses the Xen Orchestra VM (whose backup also fails).
The synology backup remote has 10TB assigned to it with 8.7TB free. The VDI disk volume has 5.4TB of 10TB free.
Status on the hosts patch-wise shows 6 patches are needed currently, though they were up-to-date last week.
XO is on commit 9ed55.
Other specifics I can provide?
Thanks!
Nick -
@nvoss Hello, The
UNDO LEAF-COEALESCE
usually has a cause that is listed in the error above it. Could you share this part please? -
@dthenot when I grep looking for coalesce I don't see any errors. Everything is the undo message.
Looking at the line labeled 3680769 in this case corresponding with one of those undo's I see lock opens, variety of what looks like successful mounts and subsequent snapshot activity then at the end the undo. After the undo message I see something not super helpful.
Attached is that entire region. Below an excerpt.
It's definitely confusing as to why a force on the job works instead of the regular run?
-
@nvoss Could you try to run
vhd-util check -n /var/run/sr-mount/f23aacc2-d566-7dc6-c9b0-bc56c749e056/3a3e915f-c903-4434-a2f0-cfc89bbe96bf.vhd
? -
@dthenot sure, here you go!
-
@nvoss The VHD is reported corrupted on the batmap. You can try to repair it with
vhd-util repair
but it'll likely not work.
I have seen people recover from this kind of error by doing a vdi-copy.
You could try a VM copy or a VDI copy and link the VDI to the VM again and see if it's alright.
The corrupted VDI is blocking the garbage collector so the chain are long and that's the error you see on XO side.
It might be needed to remove the chain by hand to resolve the issue. -
@dthenot every one of our VMs report this same error on a scheduled backup. Does that mean every one has this problem?
I'm not sure how it would've happened? It seems like the problem started after doing a rolling update to the 3 hosts about 2 months back.
I'm also not super clear on what the batmap is
-- just a shade out of my depth!
Appreciate all the suggestions though. Happy to try stuff. Migrating the VD to local storage and back to the NAS, etc?
What would make the force restart work when the scheduled regular runs dont?
-
@nvoss No, the GC is blocked because only one VDI is corrupted, the one with the check.
All other VDI are on a long chain because they couldn't coalesce.
Sorry, BATMAP is the block allocation table, it's the info of the VHD to know which block exist locally.
Migrating the VDI might work indeed, I can't really be sure. -
@nvoss said in VDI Chain on Deltas:
What would make the force restart work when the scheduled regular runs dont?
I'm not sure what you mean.
The backup need to do a snapshot to have a point to compare before exporting data.
This snapshot will create a new level of VHD that would need to be coalesced, but it's limiting the number of VHD in the chain so it fails.
This is caused by the fact that the garbage collector can't run because it can't edit the corrupted VDI.
Since there is a corrupted VDI it's not running to not create more problem on the VDI chains.
Sometime corruption mean that we don't know if a VHD has any parent for example, and if doing so we can't know what the chain looks like meaning not knowing what VHD are in what chain in the SR (Storage Repository).VDI: Virtual Disk Image in this context
VHD being the format of VDI we use at the moment in XCP-ngAfter removing the corrupted VDI, maybe automatically by the migration process (maybe you'll have to do it by hand), you can run a
sr-scan
on the SR and it launch the GC again. -
@dthenot sorry not sure how the "force restart" button option works for both our full and our delta backups vs the regular scheduled backup jobs because doing the force restart lets the job run fully each time regardless of the specific machine that may have the bad/corrupt disk? That's the orange button
And a manual snapshot works on all machines I believe too?
Is there a smooth way to track that VHD disk GUID back to its machine in the interface?