CBT: the thread to centralize your feedback

olivierlambert

I have no idea why you are the only one to have this issue, which is why it's weird

flakpyro

@olivierlambert So today i installed the latest round of updates on the test pool which moved all the VMs back and forth during a rolling pool update. I then let everything sit for a couple hours then ran the backup job and this time it did not throw any errors. So thats even more confusing.

Perhaps its because i am kicking off a backup job immediately after migrating the VMs? As a test i am going to move them around again now, wait an hour then attempt to run the job.

Edit: Waiting did not seem to help. Running the job manually again resulted in a full being run again with the same
Can't do delta with this vdi, transfer will be a full
Can't do delta, will try to get a full stream

flakpyro

SMLog output on the test pool looks the same as production pool after a manual VM migration:

I did also double check that the VM UUID does not change after the migration.

Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: opening lock file /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: acquired /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/8b0ee29e-7cbe-4e15-bd13-330a974fde2a.cbtlog', '-c']
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   pread SUCCESS
Nov 15 13:59:40 xcpng-test-01 SM: [277865] lock: released /var/lock/sm/8b0ee29e-7cbe-4e15-bd13-330a974fde2a/cbtlog
Nov 15 13:59:40 xcpng-test-01 SM: [277865] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return self._run_locked(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     rv = self._run(sr, target)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return target.list_changed_blocks()
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/VDI.py", line 757, in list_changed_blocks
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     "Source and target VDI are unrelated")
Nov 15 13:59:40 xcpng-test-01 SM: [277865]
Nov 15 13:59:40 xcpng-test-01 SM: [277865] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     ret = cmd.run(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:40 xcpng-test-01 SM: [277865]     return self._run_locked(sr)
Nov 15 13:59:40 xcpng-test-01 SM: [277865]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
--
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: opening lock file /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: acquired /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/fa7929aa-a39c-437d-9787-5218e9bcbc1a.cbtlog', '-c']
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   pread SUCCESS
Nov 15 13:59:45 xcpng-test-01 SM: [278274] lock: released /var/lock/sm/fa7929aa-a39c-437d-9787-5218e9bcbc1a/cbtlog
Nov 15 13:59:45 xcpng-test-01 SM: [278274] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return self._run_locked(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     rv = self._run(sr, target)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return target.list_changed_blocks()
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/VDI.py", line 757, in list_changed_blocks
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     "Source and target VDI are unrelated")
Nov 15 13:59:45 xcpng-test-01 SM: [278274]
Nov 15 13:59:45 xcpng-test-01 SM: [278274] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     ret = cmd.run(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
Nov 15 13:59:45 xcpng-test-01 SM: [278274]     return self._run_locked(sr)
Nov 15 13:59:45 xcpng-test-01 SM: [278274]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked

flakpyro

@olivierlambert Im making progress getting to the bottom of this thanks to some documentation from XenServer about using cbt-util.

You can use the cbt-util utility, which helps establish chain relationship. If the VDI snapshots are not linked by changed block metadata, you get errors like “SR_BACKEND_FAILURE_460”, “Failed to calculate changed blocks for given VDIs”, and “Source and target VDI are unrelated”.

Example usage of cbt-util:

 cbt-util get –c –n <name of cbt log file>

The -c option prints the child log file UUID.

I cleared all CBT snapshots from my test VMs and run a full backup on each VM. Then ensured the CBT chain was consistent using cbt-util, the output was:

[14:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
1950d6a3-c6a9-4b0c-b79f-068dd44479cc

After the backup was complete i then migrated the VM to the second host in the pool and ran the same command from both hosts:

[14:26 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
00000000-0000-0000-0000-000000000000

And from the second host:

[14:26 xcpng-test-02 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 867063fc-4d86-420a-9ad2-dfe1749ecbc1.cbtlog 
00000000-0000-0000-0000-000000000000

That clearly is the problem right there, question is, what is causing that to happen?

After running another full the zero'd out cbtlog file is removed and a new one is created which will work fine until the VM is migrated again:

[14:39 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 1eefb7bf-9dc3-4830-8352-441a77412576.cbtlog 
1950d6a3-c6a9-4b0c-b79f-068dd44479cc

rtjdamen

@flakpyro i can't reproduce this on our end, after migration within pool on the same storage pool the cbt is preserved. When i migrate to a different storage pool the cbt is reset.

flakpyro

@rtjdamen interesting, this is with iSCSI (block) or with an NFS SR?

rtjdamen

@flakpyro both scenarios

flakpyro

@rtjdamen Hmm very strange.

The only thing i can think of is that this maybe due to the fact these VMs were imported from VMware.

Next week i can try creating a brand new NFSv3 SR (Since NFS4 has created issues in the past) as well as a new clean install VM that was not imported from VMware and see if the issue persists.

flakpyro

This is a completely different 5 host pool backed by a Pure storage array with SRs mounted via NFSv3, migrating a VM between hosts results in the same issue.

Before migration:
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
e28065ff-342f-4eae-a910-b91842dd39ca

After migration
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
00000000-0000-0000-0000-000000000000

I dont think i have anything "custom" running that would be causing this so no idea why this is happening but its happening on multiple pools for us.

rtjdamen

@flakpyro is there any difference in migrating with the vm powered on or powered off?

rtjdamen

@flakpyro i have just tested live migration and offline on our end, both kept the cbt alive. Tested on both iscsi and nfs.

flakpyro

@rtjdamen

Looks like it does this if the VM is powered off as well. Im really not sure what else to try since this is happening on 2 different pools for us.

I may need to end up submitting a ticket with Vates for them to get to the bottom of it.

rtjdamen

@flakpyro are u running the latest xcp-ng version 8.2 or 8.3?

flakpyro

@rtjdamen Both pools are on 8.3 with all the latest updates.
I did find this PR on github and wonder if it may be related: https://github.com/vatesfr/xen-orchestra/pull/8127 but not sure why it would only happen after a migration....

fbeauchamp opened this pull request in vatesfr/xen-orchestra

closed fix(backups): handle slow enable cbt #8127

rtjdamen

@flakpyro we are still on 8.2 sor maybe there is some difference there.

olivierlambert

Thanks for the feedback @flakpyro and it shows it's not an XO issue. There's something not preserving CBT in your case where it shouldn't, and IDK why. But clearly, you have a way to test it easily, which is progress

flakpyro

@olivierlambert So i guess the next thing we need to do is have someone also running 8.3 test this using an NFS SR?

florent

@flakpyro said in CBT: the thread to centralize your feedback:

This is a completely different 5 host pool backed by a Pure storage array with SRs mounted via NFSv3, migrating a VM between hosts results in the same issue.
Before migration:
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
e28065ff-342f-4eae-a910-b91842dd39ca

After migration
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
00000000-0000-0000-0000-000000000000
I dont think i have anything "custom" running that would be causing this so no idea why this is happening but its happening on multiple pools for us.

This is a very interesting clue, and we will investigate it with damien

there is a lot of edges case that can happens ( a lying network/drive/... )
and most of the time , xcp/xapi are self healing, but sometimes XO have to do a little work to cleanup. The CBT should be reset correctly after storage migration.
We'll add the async call to enable/ disable CBT since it could lead to bogus state, and maybe a more in depth cleaning of cbt after a "vdi not related error "

flakpyro

@florent thanks for checking into this as we'd love to be able to use this feature. If you need me to test anything or provide any additional logs/info about our environment let me know!

flakpyro

@florent Testing a storage migration i do see CBT get disabled and reset during the process which is expected! I do notice it leaves the .cbtlog file on the old SR after the storage migration is complete but that's easy enough to clean up manually.

The issue i posted above however is just a VM migration from host to host on a shared NFS SR, the SR the VM is on is not changing.