CBT: the thread to centralize your feedback
-
Another test on a different pool seems to yield the same result:
Create VM, and add to a backup job using CBT with Snapshot deletion. Run backup job to generate the .cbtlog file.
After first backup run:
[08:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog adde7aaf-6b13-498a-b0e3-f756a57b2e78
Next take a snapshot of the VM using Xen orchestra from the Snapshot tab, check the CBT log file again, it now references the newly created snapshot:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 994174ef-c579-44e6-bc61-240fb996867e
Remove the manually created snapshot, and check the CBT log file and find that it has been corrupted:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 00000000-0000-0000-0000-000000000000
So far i can make this happen on two different pools. Would be helpful if anyone else could confirm this.
-
3rd update. This appears to happen on or test pool using NFS (TrueNAS NFS), our DR pool (Pure Storage NFS) and on our production pool (Pure storage NFS)
Testing more today this seems to occur on a shared NFS SRs where multiple hosts are connected, using local EXT storage i do not see this behaviour.
If theres any debug could i could enable to help get to the bottom of this. Or if someone else can also confirm this happens to them we can rule out something in my environments.
-
Do you think it confirms what we said with @psafont ? (sorry I don't have time right now to review all your excellent feedback in details, so I'm going straight to the point). If yes, then it's up to us to get a PR in XAPI to fix it
-
@olivierlambert This seems to be something different as i don't need to migrate a VM for this to happen. Simply have a VM with CBT enabled, and a .cbtlog file present for it, then create a regular snapshot of the VM. Upon deleting that manually created snapshot CBT data will become reset.
It happens to me on shared NFS SRs. I have not been able to make it happen on a local EXT SR. But i have this happening across 3 different pools using NFS SRs now. There is a ton of info in my posts in an attempt to be as detailed as possible! Happy to help anyway i can!
-
Hmm indeed, so it could be that CBT reset code catching far too many cases where it shouldn't (so it might be still related to the initial problem)
-
@olivierlambert that could for sure be the case. We are using NBD now without Purge snapshot data enabled for now and its been very reliable but hoping to keep chipping away at these issues so we can one day enable this on our production VMs. if there is any testing you need me to do just let me know as we have a decent test environment setup where we prototype these things before deploying for real.