CBT: the thread to centralize your feedback

rtjdamen

@flakpyro are u running the latest xcp-ng version 8.2 or 8.3?

flakpyro

@rtjdamen Both pools are on 8.3 with all the latest updates.
I did find this PR on github and wonder if it may be related: https://github.com/vatesfr/xen-orchestra/pull/8127 but not sure why it would only happen after a migration....

fbeauchamp opened this pull request in vatesfr/xen-orchestra

closed fix(backups): handle slow enable cbt #8127

rtjdamen

@flakpyro we are still on 8.2 sor maybe there is some difference there.

olivierlambert

Thanks for the feedback @flakpyro and it shows it's not an XO issue. There's something not preserving CBT in your case where it shouldn't, and IDK why. But clearly, you have a way to test it easily, which is progress

flakpyro

@olivierlambert So i guess the next thing we need to do is have someone also running 8.3 test this using an NFS SR?

florent

@flakpyro said in CBT: the thread to centralize your feedback:

This is a completely different 5 host pool backed by a Pure storage array with SRs mounted via NFSv3, migrating a VM between hosts results in the same issue.
Before migration:
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
e28065ff-342f-4eae-a910-b91842dd39ca

After migration
[01:41 xcpng-prd-03 b04d9910-8671-750f-050e-8b55c64fbede]# cbt-util get -c -n 83035854-b5a9-4f7e-869f-abe43ddc658d.cbtlog 
00000000-0000-0000-0000-000000000000
I dont think i have anything "custom" running that would be causing this so no idea why this is happening but its happening on multiple pools for us.

This is a very interesting clue, and we will investigate it with damien

there is a lot of edges case that can happens ( a lying network/drive/... )
and most of the time , xcp/xapi are self healing, but sometimes XO have to do a little work to cleanup. The CBT should be reset correctly after storage migration.
We'll add the async call to enable/ disable CBT since it could lead to bogus state, and maybe a more in depth cleaning of cbt after a "vdi not related error "

flakpyro

@florent thanks for checking into this as we'd love to be able to use this feature. If you need me to test anything or provide any additional logs/info about our environment let me know!

flakpyro

@florent Testing a storage migration i do see CBT get disabled and reset during the process which is expected! I do notice it leaves the .cbtlog file on the old SR after the storage migration is complete but that's easy enough to clean up manually.

The issue i posted above however is just a VM migration from host to host on a shared NFS SR, the SR the VM is on is not changing.

Rhodderz

We appear to have a similar issue to @flakpyro
We dont have NFS storage but using iSCSI from Dell SC5020's
We had backups with NBD and CBT enabled
We updated one of our pools to the latest (stable branch) yesterday to try and get rid of the iSCSI disconnecting bug, which meant all the vms where shuffled around and migrated.
This morning majority of the vms failed the backup with "can't create a stream from a metadata VDI, fall back to a base"
Quick searching brought me here and following what flak did i found one of the cbtlogs for one of the failed vms is also zero'd as shown below:

[09:40 xcp101 VG_XenStorage-6c2ec0ce-01ba-6975-741c-e2e86bc45e21]# cbt-util get -c -n cc2f2443-eb13-4eeb-951b-5faa3c7b8c55.cbtlog
00000000-0000-0000-0000-000000000000

We have an enterprise support with a ticket already open about NBD being slow (was on 1 NBD Connections) with a support tunnel open which I will update as well.
Hopefully that gives you another point of reference to check from.

Is it possible to force a clean fresh start for the backups similar to Veeam "Active Full"?

Forza

@Rhodderz said in CBT: the thread to centralize your feedback:

Is it possible to force a clean fresh start for the backups similar to Veeam "Active Full"?

Perhaps delete the snapshots for each vm. When backup job starts, it should be a 'full' backup.

rtjdamen

@Rhodderz are u also on 8.3?

Rhodderz

@rtjdamen Having a look i assumed we where on 8.3 as we updated yesterday and there is no available patches, but on 8.2.1

NAME="XCP-ng"
VERSION="8.2.1"
ID="xenenterprise"
ID_LIKE="centos rhel fedora"
VERSION_ID="8.2.1"
PRETTY_NAME="XCP-ng 8.2.1"

release/yangtze/master/58

Apologies forgot to check that and (wrongfully) assumed

Rhodderz

@Forza Tested this on a VM and seems i still get the same error sadly.

rtjdamen

@Rhodderz ok so it’s not only related to v8.3 as we were assuming. Or somerhing else is going on in your env. What happens if u use normal backup without cbt?

Rhodderz

@rtjdamen Just trying that now
However it seems if i disable CBT on the vm, the backup (trying a new backup job for this testing) just re-enabled it.
Seems based on the job i can have NBD+CBT or neither.
Annoyingly we would like NBD to run to speed up backups as they take quite some time.

EDIT:
To add, the new test backup for the vm that failed before actually finished successfully
Just manually rrerunning it on the main job now
If it works there, the temproary workaround could be to just disable CBT and let the backup job re-enable it.

EDIT EDIT:
Re-running the backup on the vm in the original job still failed with the same error
Testing with the new job and making it the same with NBD connection set to 2, purge snapshots after, still passes fine
So i am guessing CBT is job dependant and no vm dependant?
Which would explain why a new job on the same VM to the same place works fine?

flakpyro

For our production pool i have CBT + NBD enabled but i have "Purge snapshot data when using CBT" disabled. The results in successful backups but the snapshot is retained. I assume it then ends up using that snapshot for the following delta backups.

Rhodderz

@flakpyro ah I will try that once proxy for that pool is back
We upgrade XOA from stable channel to latest as we had another issue which is apparently resolved in that with NBD (causing some machines to go RO)
Once thats fixed I will try again to see if the above update and/or disabling "Purge snapshot" works as a workaround.

We have purge enabled (and would like it left enabled) as we use iSCSI (Dell SC5020's) so everything is a little fat, especially with some clients.
I shal update tommorow on what happens.

flakpyro

@Rhodderz I agree we are using NFS so snapshots are thin at least but we would love to be able to delete the snapshots after a backup run as well. Hopefully in time we can get this working!

Rhodderz

To add an update and to not leave on a cliff hanger.
We have since updated our XOA to the latest channel to attempt to fix an NBD issue.
This move broke a proxy of ours, but also all the backups are going through the XOA and after this the backups have not had an issue since.
So either the new NBD fixes, it being only on an XOA or something somehwere else resolved this problem for now.

We will be enabling the same in our other pool soon so will update if we have the same issues there.

flakpyro

Sadly the latest XOA release from today does not resolve my strange CBT issue,

[08:32 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#  cbt-util get -c -n 4d7f0341-bbce-4957-a4c4-d603725a807a.cbtlog 
1950d6a3-c6a9-4b0c-b79f-068dd44479cc
After Migration from Host 01 to Host 02 (Shared NFS SR):
[08:33 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#  cbt-util get -c -n 4d7f0341-bbce-4957-a4c4-d603725a807a.cbtlog 
00000000-0000-0000-0000-000000000000