CBT: the thread to centralize your feedback
-
@flakpyro good that the root cause is found, hope it is something that can be fixed by the xoa team.
I can confirm that the cbt backups are running as smooth as they did on vmware! Good job all!
-
@flakpyro we are discussing with XAPI upstream, it won't be trivial but I think we know where to work. A CBT reset shouldn't occur if the VDI do not change SR, but I think any call to VM migrate send is resetting it. And since there's no way to pool migrate on a dedicated networkβ¦
-
@olivierlambert I don't think the first was ever meant to support that. Without knowing how much effort it will be, I'm leaning towards the second option, to not reset the CBT.
-
Got it, we'll try to provide a PR that adds a check to NOT reset CBT if there's no VDI migration involved and the problem will be "solved"
-
MASSIVE EDIT AFTER FURTHER TESTING
So i have another one in my testing with CBT.
If i have VM running with CBT backups with Snapshot deletion enabled, and i remove the pool setting to specify a migration network everything appears fine and CBT data will not reset due to a migration.
However if it take a manual snapshot on a VM, and remove the snapshot after i find CBT data sometimes resets itself:
SM log shows:
[15:53 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# grep -A 5 -B 5 -i exception /var/log/SMlog Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] In cleanup Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] SR 9330 ('Syn-TestLab-DS1') (0 VDIs in 0 VHD trees): no changes Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/running Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/gc_active Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/sr Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** sr_scan: EXCEPTION <class 'util.CommandException'>, Input/output error Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] return self._run_locked(sr) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked Jan 28 11:55:09 xcpng-test-01 SM: [2041073] rv = self._run(sr, target) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 370, in _run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] return sr.scan(self.params['sr_uuid']) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/ISOSR", line 594, in scan Jan 28 11:55:09 xcpng-test-01 SM: [2041073] if not util.isdir(self.path): Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/util.py", line 542, in isdir Jan 28 11:55:09 xcpng-test-01 SM: [2041073] raise CommandException(errno.EIO, "os.stat(%s)" % path, "failed") Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Raising exception [40, The SR scan failed [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** ISO: EXCEPTION <class 'xs_errors.SROSError'>, The SR scan failed [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 385, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ret = cmd.run(sr) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 121, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] raise xs_errors.XenError(excType, opterr=msg) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] -- Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: opening lock file /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: acquired /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/58242a5a-0a6f-4e4e-bada-8331ed32eae4.cbtlog', '-c'] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] pread SUCCESS Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: released /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return self._run_locked(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked Jan 28 14:41:58 xcpng-test-01 SM: [2181235] rv = self._run(sr, target) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 326, in _run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return target.list_changed_blocks() Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/VDI.py", line 759, in list_changed_blocks Jan 28 14:41:58 xcpng-test-01 SM: [2181235] "Source and target VDI are unrelated") Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 385, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ret = cmd.run(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return self._run_locked(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked -- Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] Removed leaf-coalesce from fe6e3edd(100.000G/7.483M?) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~* Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *********************** Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] * E X C E P T I O N * Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *********************** Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI fe6e3edd-4d63-4005-b0f3-932f5f34e036 could not be coalesced Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] File "/opt/xensource/sm/cleanup.py", line 2098, in coalesceLeaf Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] self._coalesceLeaf(vdi) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] File "/opt/xensource/sm/cleanup.py", line 2380, in _coalesceLeaf Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] .format(uuid=vdi.uuid)) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]
I have been able to once again reproduce this multiple times.
Steps to reproduce:
-
Setup a backup job, Enable CBT with snapshot removal and run your first initial full backup.
-
Take a manual snapshot of the VM. After a few mins remove the snapshot and let GC run and complete.
-
Run the same backup job again. For me anyways this usually results in a full backup with the above being dumped to the SM log.
-
Afterwards all backups after this will go back to being delta and CBT will work fine again, unless i take another manual snapshot.
Is anyone else able to reproduce this?
Edit 2: Here is an example of what i am running into.
After running the initial backup job runs:[23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog adde7aaf-6b13-498a-b0e3-f756a57b2e78
After taking a manual snapshot, the CBT log reference changes as expected:
[23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog b6e33794-120a-4a95-b035-af64c6605ee2
After removing the manual snapshot:
[23:29 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 00000000-0000-0000-0000-000000000000
-
-
Another test on a different pool seems to yield the same result:
Create VM, and add to a backup job using CBT with Snapshot deletion. Run backup job to generate the .cbtlog file.
After first backup run:
[08:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog adde7aaf-6b13-498a-b0e3-f756a57b2e78
Next take a snapshot of the VM using Xen orchestra from the Snapshot tab, check the CBT log file again, it now references the newly created snapshot:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 994174ef-c579-44e6-bc61-240fb996867e
Remove the manually created snapshot, and check the CBT log file and find that it has been corrupted:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 00000000-0000-0000-0000-000000000000
So far i can make this happen on two different pools. Would be helpful if anyone else could confirm this.
-
3rd update. This appears to happen on or test pool using NFS (TrueNAS NFS), our DR pool (Pure Storage NFS) and on our production pool (Pure storage NFS)
Testing more today this seems to occur on a shared NFS SRs where multiple hosts are connected, using local EXT storage i do not see this behaviour.
If theres any debug could i could enable to help get to the bottom of this. Or if someone else can also confirm this happens to them we can rule out something in my environments.
-
Do you think it confirms what we said with @psafont ? (sorry I don't have time right now to review all your excellent feedback in details, so I'm going straight to the point). If yes, then it's up to us to get a PR in XAPI to fix it
-
@olivierlambert This seems to be something different as i don't need to migrate a VM for this to happen. Simply have a VM with CBT enabled, and a .cbtlog file present for it, then create a regular snapshot of the VM. Upon deleting that manually created snapshot CBT data will become reset.
It happens to me on shared NFS SRs. I have not been able to make it happen on a local EXT SR. But i have this happening across 3 different pools using NFS SRs now. There is a ton of info in my posts in an attempt to be as detailed as possible! Happy to help anyway i can!
-
Hmm indeed, so it could be that CBT reset code catching far too many cases where it shouldn't (so it might be still related to the initial problem)
-
@olivierlambert that could for sure be the case. We are using NBD now without Purge snapshot data enabled for now and its been very reliable but hoping to keep chipping away at these issues so we can one day enable this on our production VMs. if there is any testing you need me to do just let me know as we have a decent test environment setup where we prototype these things before deploying for real.