XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CBT: the thread to centralize your feedback

    Scheduled Pinned Locked Moved Backup
    439 Posts 37 Posters 386.8k Views 29 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • olivierlambertO Offline
      olivierlambert Vates 🪐 Co-Founder CEO
      last edited by olivierlambert

      So either we need to update pool_migrate to allow a migration network parameter, OR we need to be sure than migrate_send do NOT reset the CBT data if the storage do not change. @psafont Do you have a preference on which one we need to do upstream?

      F psafontP 2 Replies Last reply Reply Quote 0
      • F Offline
        flakpyro @olivierlambert
        last edited by

        As an update we just spun up our DR pool yesterday,a fresh install of XCP-NG 8.3 on all hosts in the pool. Testing migrations and backups with CBT enabled shows the same behaviour we experience on the other pools. Removing the default migration network allows CBT to work properly, however specifying a default migration network causes CBT to be reset after a VM migration. So i think this is pretty reproducible 🙂 at least using a file based SR like NFS.

        R 1 Reply Last reply Reply Quote 0
        • R Offline
          rtjdamen @flakpyro
          last edited by

          @flakpyro good that the root cause is found, hope it is something that can be fixed by the xoa team.

          I can confirm that the cbt backups are running as smooth as they did on vmware! Good job all!

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Offline
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            @flakpyro we are discussing with XAPI upstream, it won't be trivial but I think we know where to work. A CBT reset shouldn't occur if the VDI do not change SR, but I think any call to VM migrate send is resetting it. And since there's no way to pool migrate on a dedicated network…

            1 Reply Last reply Reply Quote 0
            • psafontP Offline
              psafont @olivierlambert
              last edited by

              @olivierlambert I don't think the first was ever meant to support that. Without knowing how much effort it will be, I'm leaning towards the second option, to not reset the CBT.

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Offline
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Got it, we'll try to provide a PR that adds a check to NOT reset CBT if there's no VDI migration involved and the problem will be "solved" 😄

                F 1 Reply Last reply Reply Quote 0
                • F Offline
                  flakpyro @olivierlambert
                  last edited by flakpyro

                  @olivierlambert

                  MASSIVE EDIT AFTER FURTHER TESTING

                  So i have another one in my testing with CBT.

                  If i have VM running with CBT backups with Snapshot deletion enabled, and i remove the pool setting to specify a migration network everything appears fine and CBT data will not reset due to a migration.

                  However if it take a manual snapshot on a VM, and remove the snapshot after i find CBT data sometimes resets itself:

                  SM log shows:

                  [15:53 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# grep -A 5 -B 5 -i exception /var/log/SMlog
                  Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] In cleanup
                  Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] SR 9330 ('Syn-TestLab-DS1') (0 VDIs in 0 VHD trees): no changes
                  Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/running
                  Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/gc_active
                  Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/sr
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** sr_scan: EXCEPTION <class 'util.CommandException'>, Input/output error
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     return self._run_locked(sr)
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     rv = self._run(sr, target)
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 370, in _run
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     return sr.scan(self.params['sr_uuid'])
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/ISOSR", line 594, in scan
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     if not util.isdir(self.path):
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/util.py", line 542, in isdir
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     raise CommandException(errno.EIO, "os.stat(%s)" % path, "failed")
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Raising exception [40, The SR scan failed  [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]]
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** ISO: EXCEPTION <class 'xs_errors.SROSError'>, The SR scan failed  [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     ret = cmd.run(sr)
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 121, in run
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     raise xs_errors.XenError(excType, opterr=msg)
                  Jan 28 11:55:09 xcpng-test-01 SM: [2041073]
                  --
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: opening lock file /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: acquired /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/58242a5a-0a6f-4e4e-bada-8331ed32eae4.cbtlog', '-c']
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   pread SUCCESS
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: released /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return self._run_locked(sr)
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     rv = self._run(sr, target)
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return target.list_changed_blocks()
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/VDI.py", line 759, in list_changed_blocks
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     "Source and target VDI are unrelated")
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     ret = cmd.run(sr)
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return self._run_locked(sr)
                  Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                  --
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] Removed leaf-coalesce from fe6e3edd(100.000G/7.483M?)
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          ***********************
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          *  E X C E P T I O N  *
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          ***********************
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI fe6e3edd-4d63-4005-b0f3-932f5f34e036 could not be coalesced
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]   File "/opt/xensource/sm/cleanup.py", line 2098, in coalesceLeaf
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]     self._coalesceLeaf(vdi)
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]   File "/opt/xensource/sm/cleanup.py", line 2380, in _coalesceLeaf
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]     .format(uuid=vdi.uuid))
                  Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]
                  
                  

                  I have been able to once again reproduce this multiple times.

                  Steps to reproduce:

                  1. Setup a backup job, Enable CBT with snapshot removal and run your first initial full backup.

                  2. Take a manual snapshot of the VM. After a few mins remove the snapshot and let GC run and complete.

                  3. Run the same backup job again. For me anyways this usually results in a full backup with the above being dumped to the SM log.

                  4. Afterwards all backups after this will go back to being delta and CBT will work fine again, unless i take another manual snapshot.

                  Is anyone else able to reproduce this?

                  Edit 2: Here is an example of what i am running into.
                  After running the initial backup job runs:

                  [23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                  adde7aaf-6b13-498a-b0e3-f756a57b2e78
                  

                  After taking a manual snapshot, the CBT log reference changes as expected:

                  [23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                  b6e33794-120a-4a95-b035-af64c6605ee2
                  

                  After removing the manual snapshot:

                  [23:29 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                  00000000-0000-0000-0000-000000000000
                  
                  F 1 Reply Last reply Reply Quote 0
                  • F Offline
                    flakpyro @flakpyro
                    last edited by flakpyro

                    Another test on a different pool seems to yield the same result:

                    Create VM, and add to a backup job using CBT with Snapshot deletion. Run backup job to generate the .cbtlog file.

                    After first backup run:

                    [08:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 
                    adde7aaf-6b13-498a-b0e3-f756a57b2e78
                    

                    Next take a snapshot of the VM using Xen orchestra from the Snapshot tab, check the CBT log file again, it now references the newly created snapshot:

                    [08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 
                    994174ef-c579-44e6-bc61-240fb996867e
                    

                    Remove the manually created snapshot, and check the CBT log file and find that it has been corrupted:

                    [08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 
                    00000000-0000-0000-0000-000000000000
                    

                    So far i can make this happen on two different pools. Would be helpful if anyone else could confirm this. 🙂

                    F 1 Reply Last reply Reply Quote 0
                    • F Offline
                      flakpyro @flakpyro
                      last edited by

                      3rd update. This appears to happen on or test pool using NFS (TrueNAS NFS), our DR pool (Pure Storage NFS) and on our production pool (Pure storage NFS)

                      Testing more today this seems to occur on a shared NFS SRs where multiple hosts are connected, using local EXT storage i do not see this behaviour.

                      If theres any debug could i could enable to help get to the bottom of this. Or if someone else can also confirm this happens to them we can rule out something in my environments.

                      1 Reply Last reply Reply Quote 0
                      • olivierlambertO Offline
                        olivierlambert Vates 🪐 Co-Founder CEO
                        last edited by

                        Do you think it confirms what we said with @psafont ? (sorry I don't have time right now to review all your excellent feedback in details, so I'm going straight to the point). If yes, then it's up to us to get a PR in XAPI to fix it 🙂

                        F 1 Reply Last reply Reply Quote 0
                        • F Offline
                          flakpyro @olivierlambert
                          last edited by

                          @olivierlambert This seems to be something different as i don't need to migrate a VM for this to happen. Simply have a VM with CBT enabled, and a .cbtlog file present for it, then create a regular snapshot of the VM. Upon deleting that manually created snapshot CBT data will become reset.

                          It happens to me on shared NFS SRs. I have not been able to make it happen on a local EXT SR. But i have this happening across 3 different pools using NFS SRs now. There is a ton of info in my posts in an attempt to be as detailed as possible! Happy to help anyway i can!

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Offline
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            Hmm indeed, so it could be that CBT reset code catching far too many cases where it shouldn't 🤔 (so it might be still related to the initial problem)

                            F 1 Reply Last reply Reply Quote 0
                            • F Offline
                              flakpyro @olivierlambert
                              last edited by

                              @olivierlambert that could for sure be the case. We are using NBD now without Purge snapshot data enabled for now and its been very reliable but hoping to keep chipping away at these issues so we can one day enable this on our production VMs. 🙂 if there is any testing you need me to do just let me know as we have a decent test environment setup where we prototype these things before deploying for real.

                              1 Reply Last reply Reply Quote 0
                              • R Offline
                                rtjdamen
                                last edited by

                                Goodmorning all! Hope the xen winter meetup has been great for the ones who joined!

                                We migrated to XCP-NG 8.3 this week from 8.2.1. On 8.2.1 we had everything working like a charm without issues with cbt anymore. On 8.3 however we seem to run into some frustrating things i can’t really put a finger on the root cause. So i think it’s important to share them with you all so we can see where it goes wrong and how we could fix them.

                                The first one is related to NBD/Xapi, what i found is that in some rare cases one of the vdi’s from a vm is stucked at the nbd host. I had 3 overnight after restarting the dedicated hosts but now i see new ones on different servers. We have set nbd limit to 1 connection but strange thing was in one case i was able to see there where 4 nbd connections or 4 vdi’s connected, not shure but 3 where disconnected 1 left. Could it be the software is sometimes not respecting the limit and then setting up multiple connections?

                                Second issue i found is that sometimes cbt is broken on a vm, in that case u can run a full over and over again but that will not work until you disable cbt and enable it again, forcing it someway to setup a new cbt chain. For some reason in the other cases it remains broken. Would it be an option to let the software handle this automatically? Be aware that i found some cases where disabling cbt on a running vm caused an error map_duplicate_key, this only happened when i disabled cbt manually on that vdi, when i did an online storage migration (what is doing the same if i am correct) it was working without this issue, not shure what is the difference but if the same error occurs when doing it automatically from the software u can cause your vm to be destroyed ;-).

                                Hope anyone can help me into the correct direction on this subjects! I will also open a ticket with support to investigate but would like to share my new experiences here, anyone seeing the same issues?

                                R 1 Reply Last reply Reply Quote 0
                                • R Offline
                                  rtjdamen @rtjdamen
                                  last edited by

                                  Hi All,

                                  are there other users running 8.3 combined with CBT backups? we keep running into vdi's that stay connected to the control domain, we are investigating these with Vates as it looks like the problem is within xapi of tapdisk it would be helpfull to understand if there are others running into the same issue.

                                  As far as i understand from vates it looks like the problem we face is somewhere within Xapi but we need more logging to understand it better. If anyone has this running with the same issues or without please let us know so we can compare.

                                  Cheers!

                                  A 1 Reply Last reply Reply Quote 1
                                  • A Offline
                                    Andrw0830 @rtjdamen
                                    last edited by

                                    @rtjdamen I can confirm in our environment with XCP-NG 8.3 and XO from sources (from January's end of month release) that we have this exact issue. To confirm what happens is if you define a default migration network during a VM migration, CBT gets disabled and the next backup displays the unable to do CBT on this VDI error. This happens on shared NFS SR on same pool.

                                    F R 2 Replies Last reply Reply Quote 0
                                    • F Offline
                                      flakpyro @Andrw0830
                                      last edited by

                                      @Andrw0830 Can you also confirm if taking a regular snapshot from XO, then deleting it sometime later causes CBT to also reset as i have ran into as well? (Above)

                                      A 1 Reply Last reply Reply Quote 0
                                      • R Offline
                                        rtjdamen @Andrw0830
                                        last edited by

                                        @Andrw0830 so just to confirm, you also have random vdi’s that stay connected to control domain?

                                        The issue with the migration network is a known onee i understand from vates and is currently being worked on.

                                        1 Reply Last reply Reply Quote 0
                                        • A Offline
                                          Andrw0830 @flakpyro
                                          last edited by

                                          @flakpyro I usually don't keep snapshots on my VMs and just have it create snapshots as part of the NBD/CBT backups and have it purge then afterwards. I've tested created a standard snapshot (without memory), deleted it and did a backup and it was able to use the old NBD and CBT chain without error.

                                          @rtjdamen I just have the migration network issue so thought you were referring to that so not sure about the control domain part. Is there a way to check or know if we are affected?

                                          R 1 Reply Last reply Reply Quote 0
                                          • R Offline
                                            rtjdamen @Andrw0830
                                            last edited by

                                            @Andrw0830 yeah shure, from dashboard u have the health tab, there u find vdi attached to control domain, we gave random vdi’s that stay attached. Although it seems to be random so we are investigating if this is an infeastructure thing or a bug. In xapi. We did not face this on 8.2.

                                            We do not have issues with cbt after creating and then removing snapshots.

                                            A 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post