CBT: the thread to centralize your feedback

flakpyro

@olivierlambert So today i installed the latest round of updates on the test pool which moved all the VMs back and forth during a rolling pool update. I then let everything sit for a couple hours then ran the backup job and this time it did not throw any errors. So thats even more confusing.

Perhaps its because i am kicking off a backup job immediately after migrating the VMs? As a test i am going to move them around again now, wait an hour then attempt to run the job.

Edit: Waiting did not seem to help. Running the job manually again resulted in a full being run again with the same
Can't do delta with this vdi, transfer will be a full
Can't do delta, will try to get a full stream

flakpyro

SMLog output on the test pool looks the same as production pool after a manual VM migration:

I did also double check that the VM UUID does not change after the migration.

Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: opening lock file /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: acquired /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/8b0ee29e-7cbe-4e15-bd13-330a974fde2a.cbtlog', '-c']
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   pread SUCCESS
Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: released /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return self._run_locked(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     rv = self._run(sr, target)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return target.list_changed_blocks()
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/VDI.py", line 757, in list_changed_blocks
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     "Source and target VDI are unrelated")
Nov 15 13:59:40 xcpng-test-01 SM: [277865]
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     ret = cmd.run(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return self._run_locked(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
--
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: opening lock file /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: acquired /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/fa7929aa-a39c-437d-9787-5218e9bcbc1a.cbtlog', '-c']
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   pread SUCCESS
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: released /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return self._run_locked(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     rv = self._run(sr, target)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return target.list_changed_blocks()
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/VDI.py", line 757, in list_changed_blocks
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     "Source and target VDI are unrelated")
Nov 15 13:59:45 xcpng-test-01 SM: [278274]
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     ret = cmd.run(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return self._run_locked(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked

flakpyro

@olivierlambert Im making progress getting to the bottom of this thanks to some documentation from XenServer about using cbt-util.

You can use the cbt-util utility, which helps establish chain relationship. If the VDI snapshots are not linked by changed block metadata, you get errors like “SR_BACKEND_FAILURE_460”, “Failed to calculate changed blocks for given VDIs”, and “Source and target VDI are unrelated”.

Example usage of cbt-util:

 cbt-util get –c –n <name of cbt log file>

The -c option prints the child log file UUID.

I cleared all CBT snapshots from my test VMs and run a full backup on each VM. Then ensured the CBT chain was consistent using cbt-util, the output was:

[14:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
1950d6a3-c6a9-4b0c-b79f-068dd44479cc

After the backup was complete i then migrated the VM to the second host in the pool and ran the same command from both hosts:

[14:26 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
00000000-0000-0000-0000-000000000000

And from the second host:

[14:26 xcpng-test-02 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
00000000-0000-0000-0000-000000000000

That clearly is the problem right there, question is, what is causing that to happen?

After running another full the zero'd out cbtlog file is removed and a new one is created which will work fine until the VM is migrated again:

[14:39 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 1eefb7bf-9dc3-4830-8352-441a77412576.cbtlog 
1950d6a3-c6a9-4b0c-b79f-068dd44479cc

rtjdamen

@flakpyro i can't reproduce this on our end, after migration within pool on the same storage pool the cbt is preserved. When i migrate to a different storage pool the cbt is reset.

flakpyro

@rtjdamen interesting, this is with iSCSI (block) or with an NFS SR?

rtjdamen

@flakpyro both scenarios

flakpyro

@rtjdamen Hmm very strange.

The only thing i can think of is that this maybe due to the fact these VMs were imported from VMware.

Next week i can try creating a brand new NFSv3 SR (Since NFS4 has created issues in the past) as well as a new clean install VM that was not imported from VMware and see if the issue persists.

flakpyro

This is a completely different 5 host pool backed by a Pure storage array with SRs mounted via NFSv3, migrating a VM between hosts results in the same issue.

Before migration:
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
e28065ff-342f-4eae-a910-b91842dd39ca

After migration
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
00000000-0000-0000-0000-000000000000

I dont think i have anything "custom" running that would be causing this so no idea why this is happening but its happening on multiple pools for us.

rtjdamen

@flakpyro is there any difference in migrating with the vm powered on or powered off?

rtjdamen

@flakpyro i have just tested live migration and offline on our end, both kept the cbt alive. Tested on both iscsi and nfs.

flakpyro

@rtjdamen

Looks like it does this if the VM is powered off as well. Im really not sure what else to try since this is happening on 2 different pools for us.

I may need to end up submitting a ticket with Vates for them to get to the bottom of it.

rtjdamen

@flakpyro are u running the latest xcp-ng version 8.2 or 8.3?

flakpyro

@rtjdamen Both pools are on 8.3 with all the latest updates.
I did find this PR on github and wonder if it may be related: https://github.com/vatesfr/xen-orchestra/pull/8127 but not sure why it would only happen after a migration....

fbeauchamp opened this pull request in vatesfr/xen-orchestra

closed fix(backups): handle slow enable cbt #8127

rtjdamen

@flakpyro we are still on 8.2 sor maybe there is some difference there.

olivierlambert

Thanks for the feedback @flakpyro and it shows it's not an XO issue. There's something not preserving CBT in your case where it shouldn't, and IDK why. But clearly, you have a way to test it easily, which is progress

flakpyro

@olivierlambert So i guess the next thing we need to do is have someone also running 8.3 test this using an NFS SR?

florent

@flakpyro said in CBT: the thread to centralize your feedback:

This is a completely different 5 host pool backed by a Pure storage array with SRs mounted via NFSv3, migrating a VM between hosts results in the same issue.
Before migration:
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
e28065ff-342f-4eae-a910-b91842dd39ca

After migration
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
00000000-0000-0000-0000-000000000000
I dont think i have anything "custom" running that would be causing this so no idea why this is happening but its happening on multiple pools for us.

This is a very interesting clue, and we will investigate it with damien

there is a lot of edges case that can happens ( a lying network/drive/... )
and most of the time , xcp/xapi are self healing, but sometimes XO have to do a little work to cleanup. The CBT should be reset correctly after storage migration.
We'll add the async call to enable/ disable CBT since it could lead to bogus state, and maybe a more in depth cleaning of cbt after a "vdi not related error "

flakpyro

@florent thanks for checking into this as we'd love to be able to use this feature. If you need me to test anything or provide any additional logs/info about our environment let me know!

flakpyro

@florent Testing a storage migration i do see CBT get disabled and reset during the process which is expected! I do notice it leaves the .cbtlog file on the old SR after the storage migration is complete but that's easy enough to clean up manually.

The issue i posted above however is just a VM migration from host to host on a shared NFS SR, the SR the VM is on is not changing.

Rhodderz

We appear to have a similar issue to @flakpyro
We dont have NFS storage but using iSCSI from Dell SC5020's
We had backups with NBD and CBT enabled
We updated one of our pools to the latest (stable branch) yesterday to try and get rid of the iSCSI disconnecting bug, which meant all the vms where shuffled around and migrated.
This morning majority of the vms failed the backup with "can't create a stream from a metadata VDI, fall back to a base"
Quick searching brought me here and following what flak did i found one of the cbtlogs for one of the failed vms is also zero'd as shown below:

[09:40 xcp101 VG_XenStorage-6c2ec0ce-01ba-6975-741c-e2e86bc45e21]# cbt-util get -c -n cc2f2443-eb13-4eeb-951b-5faa3c7b8c55.cbtlog
00000000-0000-0000-0000-000000000000

We have an enterprise support with a ticket already open about NBD being slow (was on 1 NBD Connections) with a support tunnel open which I will update as well.
Hopefully that gives you another point of reference to check from.

Is it possible to force a clean fresh start for the backups similar to Veeam "Active Full"?