CBT: the thread to centralize your feedback

Anonabhar

I think I may have a bit of a similar problem here. About a week ago, I did an update to the broken version of XO and it threw the same error as is in the subject line here. I reverted and everything was OK, but then I started to get unhealthy VDI warnings on my backups.

I tried to rescan the SR and I would see in the SMLog that it believed another GC was running, so it would abort. Rebooting the host was the only way to force the coalesce to complete; however as soon as the next inc-backup ran, it would go into the same problem (the GC thinking another is running and would no do any work).

I then did a full power off of the host, reboot and let all the VM's sit in a "powered off" state, rescanned the SR and let it coalesce. Once everything was idle, I then deleted all snapshots and waited for the coalesce to finish. Only then did I restart the VM's. Now a few VM's immediately have come up as 'unhealthy' and once again the GC will not run, thinking there is another GC working..

I'm kind of running out of idea's 8-) Does anyone know what might be stuck or what I need to look for to find out?

Just a side note here. I noticed that all the VM's that I am having problems with have CBT enabled.

I have a VM that is a snapshot only VM and even when the coalesces is stuck, I can delete snapshots off this non-cbt VM and the coalesces process runs (then gives an exception when it gets to the VM's that have CBT enabled)

Is there a way to disable CBT?

florent

Hello everybody,
thanks for your feedback, Here is a work branch with CBT enabled : https://github.com/vatesfr/xen-orchestra/pull/7792 . The branch name is fix_cbt

It fixes :

snapshot retention with full backups
off by one error for retention length
parent locator error
can't destructure undefined error
it don't leak vdi attached in the dom0 in our lab
progress is back on the export task

Please test it if you can , and don't hesitate to provide feedback

Regards,

Florent

fbeauchamp opened this pull request in vatesfr/xen-orchestra

draft fix(backups): CBT omnibus #7792

Anonabhar

For those that may be stuck, like I was, I finally have un-done the coaless nightmare the previous CBT did.

For note: I am using XCP-ng 8.3 Beta fully patched.

What I had to do was shutdown every VM and delete every snapshot
Find every VDI that had CBT enabled and disable it. I did this in a simple bash command (not the best, I know)

for i in `xe vdi-list cbt-enabled=true | grep "^uuid ( RO)" | cut -d " " -f 20`
do
     echo $i
     xe vdi-disable-cbt uuid=$i
done

Reboot the server
Create a snapshot on any VM and immidately delete it. (If you just do a rescan, it says that the GC is running when it is not but for whatever reason, deleting a shapshot seems to kick in the GC regardless)
Keep an eye on the SMLog and look for exceptions... I tend to do something like: (It will sleep for 5 minutes - so dont get anxious)

tail -f /var/log/SMLog | grep SMGC

When it finishes, check XO to see if there are any remaining uncoalessed disk and repeat from step 4.

It took about 5 iterations of the above to finally clean up all the stuck coalessed leafs but it eventually did it. The key, for me, was making sure the VM's were not running and turning CBT off.

rtjdamen

@florent hi Florent, i would love to help u test this on our lab, i have XO from sources running there, but i have no cbt options, do i need to download it in a specific way?

Delgado

@florent I'll be more than happy to help. I will get my homelab instance upgraded to that branch and report back with any issues,

olivierlambert

@rtjdamen You need to switch on fix_cbt branch, like git checkout fix_cbt and rebuild.

rtjdamen

@olivierlambert thank you, found it, i will run some backups with one or two vms to start with and will report the results.

rtjdamen

This seems to be working fine. Once the backup is complete, we'll execute the vdi_data_destroy command, right? Currently, it doesn't appear obvious that this is a CBT metadata-only snapshot. Is there a way to make this more visible?

olivierlambert

You mean in the VM view/snapshot tab? You are seeing the VM snapshot, not the VDI snapshot, so I wonder if this VM snapshot can be reverted while being CBT metadata only, and if not, we must make it clear in the UI, yes!

Delgado

I enabled cbt on the disks and nbd + cbt in my delta backup and so far so good. I plan on letting another backup run over night. I also ran a full backup and it removed the snapshot like it's supposed to.

rtjdamen

@olivierlambert yes indeed, this is currently visible like a normal snapshot, i think it should be visible like a metadata only snapshot.

rtjdamen

@florent i have been watching the backup process and in the end i only seed vdi.destroy happening nog vdi.data_destroy is this correct? are we handling this last step correct or do we remain data on the snapshot at this time?

florent

dataDestroy will be enable-able (not sure if it's really a word) today, in he meantime, the

Please note that the metadata snapshot won't be visible in the UI since it's not a VM Snapshot, but only the metadata of the vdi snapshots

latest commits in the fix_cbt branch add an additionnal check on dom0 connect, more error handling

rtjdamen

@florent ok so currently the data remains? When do u think this addition is ready for testing? I am interested as we saw some issues with this on nfs and i am curious if it will make a difference with this code.

@olivierlambert i now understand there is in general no difference on coalesce as long as the data destroy is not done. So u were right on that part and it’s safe pushing it this way!

olivierlambert

Yes, that's why we'll be able to offer a safe route for people not using the data destroy but leave people who want to explore it to do so in opt in

florent

@rtjdamen it's still fresh, but on the other hand, the worse that can happen is falling back to a full backup. So for now I would not use it on the bigger VM ( multi terabytes )
We are sure that it will be a game changer on thick provisioning ( because snapshot cost the full virtual size) or on fast changing VM , where coalescing an older snapshot is a major hurdle

If everything goes well it will be on stable by the end of july, and we'll probably enable it by default on new backup in the near future

Tristis Oris

can't commit, too small for ticket.

typo

preferNbdInformation:
    'A network accessible by XO or the proxy must have NBD enabled,. Storage must support Change Block Tracking (CBT) to ue it in a backup',

enabled,.
to ue

rtjdamen

This post is deleted!

Tristis Oris

updated to fix_cbt branch.

CR NBD backup works.
Delta NBD backup works.
just once, so we can't be sure yet.

No broken tasks is generated.

Still confused why CBT toggle is enabled on some VMs.
2 similars vms on same pool, same storage, same ubuntu version. One is enabled automaticaly, other is not.

rtjdamen

@florent i did some testing with the data_destroy branch on my lab, it seems to work as required, indeed the snapshot is hidden when it is cbt only.

What i am not shure is correct, when the data destroy action is done, i would expect a snapshot is showing up for coalesce but it does not. Is it too small, and quick removed so it will not be visible in XOA? on larger vms with our production i can see these snapshots showing for coalesce? Or when you do vdi.data_destroy will it try to coalesce directly without garbage collection afterwards?