XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    CBT: the thread to centralize your feedback

    Scheduled Pinned Locked Moved Backup
    439 Posts 37 Posters 386.5k Views 29 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • F Offline
      flakpyro @olivierlambert
      last edited by

      @olivierlambert
      Im on it! However after searching the XCP-NG docs as well as the XenServer docs i can't see to find how to specify a migration network using xe from the cli. Are you able to provide me the flag i need to use?

      1 Reply Last reply Reply Quote 0
      • olivierlambertO Offline
        olivierlambert Vates πŸͺ Co-Founder CEO
        last edited by

        I don't remember the command but @MathieuRA should be able to tell you which call we do to the XAPI when we add a migration network.

        F 1 Reply Last reply Reply Quote 0
        • F Offline
          flakpyro @olivierlambert
          last edited by

          @olivierlambert @MathieuRA once you are able to provide me xe migrate flag to specify a migration network i will test this ASAP. I think we're really close to getting to the bottom of this issue! πŸ™‚

          1 Reply Last reply Reply Quote 0
          • R Offline
            rtjdamen
            last edited by

            Hi All,

            First of all best wished to you all for 2025! I have just deployed the latest build to do some testing on the one remaining issue we have with cbt backups, we were still facing full backups on some vms, this is expected to happen because cbt is not activated fast enough on some vdi’s, i will update this post once it completed some test runs to let u know if this build resolves it (there is a fix inside this build that should fix it).

            Robin

            1 Reply Last reply Reply Quote 1
            • olivierlambertO Offline
              olivierlambert Vates πŸͺ Co-Founder CEO
              last edited by

              Happy new year and thank you very much for the feedback provided in here πŸ™‚

              R 1 Reply Last reply Reply Quote 0
              • R Offline
                rtjdamen @olivierlambert
                last edited by

                @olivierlambert my pleasure, good to be a part of it.

                Good news, this bug seems to be resolved!

                Hope we can fix the migration bug as well!

                F 1 Reply Last reply Reply Quote 0
                • F Offline
                  flakpyro @rtjdamen
                  last edited by

                  I think we have a pretty good idea of the cause now, It seems to be related to having a migration network specific at the pool level.

                  I think we are closer than ever to having this worked out and should help a lot of us using a dedicated migration network. (As was best practice in Vmware land) What are the next steps we need to take?

                  MathieuRAM 1 Reply Last reply Reply Quote 0
                  • olivierlambertO Offline
                    olivierlambert Vates πŸͺ Co-Founder CEO
                    last edited by

                    We need to wait for the largest part of the team to get back from vacation on Monday πŸ˜‰

                    1 Reply Last reply Reply Quote 1
                    • MathieuRAM Offline
                      MathieuRA Vates πŸͺ XO Team @flakpyro
                      last edited by MathieuRA

                      Hi @flakpyro πŸ™‚
                      You can do xe help vm-migrate to see all available parameters and a small description.

                      BTW, in XO, if a network is specified for the migration, we call vm.migrate_send otherwise vm.pool_migrate.
                      vm.migrate_send also migrates VM's VIFs and VDIs.

                      Questions for the XCP team:

                      • What happens if you do a vm.migrate_send but the destination SR for the VDIs is the same?
                      • Is there a way to call vm.pool_migrate using a specific network?
                      F 1 Reply Last reply Reply Quote 0
                      • F Offline
                        flakpyro @MathieuRA
                        last edited by

                        @MathieuRA

                        Thanks for the tip!

                        Looking at the output:

                        command name            : vm-migrate
                                reqd params     : 
                                optional params : live, host, host-uuid, remote-master, remote-username, remote-password, remote-network, force, copy, compress, vif:, vdi:, <vm-selectors>
                        

                        Ir does not appear there is a way for me to specify a migration network using the vm-migrate command?

                        It sounds to me like vm.migrate_send is causing CBT to be reset while vm.pool_migrate is leaving it intact? The difference between a migration that is known to be kept within a pool vs one that could potentially be migrating a VM anywhere?

                        1 Reply Last reply Reply Quote 0
                        • olivierlambertO Offline
                          olivierlambert Vates πŸͺ Co-Founder CEO
                          last edited by

                          Adding @psafont in the loop because it seems we should use pool_migrate with a network migration and/or migrate_send not resetting the CBT.

                          psafontP 1 Reply Last reply Reply Quote 0
                          • psafontP Offline
                            psafont @olivierlambert
                            last edited by

                            @olivierlambert xe vm-migrate uses migrate_send when storage or network needs to be changed, otherwise vm.pool_migrate is used. Selecting a new network is done through a the vif parameter. This parameter is a map in the form of vif:<VIF_UUID>=<NEW_NETWORK_UUID> vif:<VIF_UUID2>=<NEW_NETWORK_UUID2> (and so on).

                            So I'm not so sure that a netwrok migration can happen when using pool_migrate.

                            F 1 Reply Last reply Reply Quote 0
                            • F Offline
                              flakpyro @psafont
                              last edited by

                              @psafont

                              So in the case where CBT is being reset the network of the VM is not actually being changed during migration. The VM is moving from Host A to Host B within the same pool, using NFS shared storage which is also not changing. However when "Default Migration Network" in the pools advanced tab is set on the pool, CBT data is reset. When a default migration network is not set, the CBT data remains in tact.

                              I seems like migrate_send will always reset CBT data during a migration then even if its within the same pool on shared storage and that this is used when a default migration network is specified in XO's Pool - Advanced tab. While vm.pool_migrate will not reset CBT but is only used when a default migration network is NOT set in XO's Pool - Advanced tab. Not sure how we work around that short of not using a dedicated migration network?

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Offline
                                olivierlambert Vates πŸͺ Co-Founder CEO
                                last edited by olivierlambert

                                So either we need to update pool_migrate to allow a migration network parameter, OR we need to be sure than migrate_send do NOT reset the CBT data if the storage do not change. @psafont Do you have a preference on which one we need to do upstream?

                                F psafontP 2 Replies Last reply Reply Quote 0
                                • F Offline
                                  flakpyro @olivierlambert
                                  last edited by

                                  As an update we just spun up our DR pool yesterday,a fresh install of XCP-NG 8.3 on all hosts in the pool. Testing migrations and backups with CBT enabled shows the same behaviour we experience on the other pools. Removing the default migration network allows CBT to work properly, however specifying a default migration network causes CBT to be reset after a VM migration. So i think this is pretty reproducible πŸ™‚ at least using a file based SR like NFS.

                                  R 1 Reply Last reply Reply Quote 0
                                  • R Offline
                                    rtjdamen @flakpyro
                                    last edited by

                                    @flakpyro good that the root cause is found, hope it is something that can be fixed by the xoa team.

                                    I can confirm that the cbt backups are running as smooth as they did on vmware! Good job all!

                                    1 Reply Last reply Reply Quote 0
                                    • olivierlambertO Offline
                                      olivierlambert Vates πŸͺ Co-Founder CEO
                                      last edited by

                                      @flakpyro we are discussing with XAPI upstream, it won't be trivial but I think we know where to work. A CBT reset shouldn't occur if the VDI do not change SR, but I think any call to VM migrate send is resetting it. And since there's no way to pool migrate on a dedicated network…

                                      1 Reply Last reply Reply Quote 0
                                      • psafontP Offline
                                        psafont @olivierlambert
                                        last edited by

                                        @olivierlambert I don't think the first was ever meant to support that. Without knowing how much effort it will be, I'm leaning towards the second option, to not reset the CBT.

                                        1 Reply Last reply Reply Quote 0
                                        • olivierlambertO Offline
                                          olivierlambert Vates πŸͺ Co-Founder CEO
                                          last edited by

                                          Got it, we'll try to provide a PR that adds a check to NOT reset CBT if there's no VDI migration involved and the problem will be "solved" πŸ˜„

                                          F 1 Reply Last reply Reply Quote 0
                                          • F Offline
                                            flakpyro @olivierlambert
                                            last edited by flakpyro

                                            @olivierlambert

                                            MASSIVE EDIT AFTER FURTHER TESTING

                                            So i have another one in my testing with CBT.

                                            If i have VM running with CBT backups with Snapshot deletion enabled, and i remove the pool setting to specify a migration network everything appears fine and CBT data will not reset due to a migration.

                                            However if it take a manual snapshot on a VM, and remove the snapshot after i find CBT data sometimes resets itself:

                                            SM log shows:

                                            [15:53 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# grep -A 5 -B 5 -i exception /var/log/SMlog
                                            Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] In cleanup
                                            Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] SR 9330 ('Syn-TestLab-DS1') (0 VDIs in 0 VHD trees): no changes
                                            Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/running
                                            Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/gc_active
                                            Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/sr
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** sr_scan: EXCEPTION <class 'util.CommandException'>, Input/output error
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     return self._run_locked(sr)
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     rv = self._run(sr, target)
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 370, in _run
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     return sr.scan(self.params['sr_uuid'])
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/ISOSR", line 594, in scan
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     if not util.isdir(self.path):
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/util.py", line 542, in isdir
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     raise CommandException(errno.EIO, "os.stat(%s)" % path, "failed")
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Raising exception [40, The SR scan failed  [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]]
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** ISO: EXCEPTION <class 'xs_errors.SROSError'>, The SR scan failed  [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     ret = cmd.run(sr)
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]   File "/opt/xensource/sm/SRCommand.py", line 121, in run
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]     raise xs_errors.XenError(excType, opterr=msg)
                                            Jan 28 11:55:09 xcpng-test-01 SM: [2041073]
                                            --
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: opening lock file /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: acquired /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/58242a5a-0a6f-4e4e-bada-8331ed32eae4.cbtlog', '-c']
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   pread SUCCESS
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: released /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]]
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return self._run_locked(sr)
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     rv = self._run(sr, target)
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 326, in _run
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return target.list_changed_blocks()
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/VDI.py", line 759, in list_changed_blocks
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     "Source and target VDI are unrelated")
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 385, in run
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     ret = cmd.run(sr)
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 111, in run
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]     return self._run_locked(sr)
                                            Jan 28 14:41:58 xcpng-test-01 SM: [2181235]   File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked
                                            --
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] Removed leaf-coalesce from fe6e3edd(100.000G/7.483M?)
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          ***********************
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          *  E X C E P T I O N  *
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]          ***********************
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI fe6e3edd-4d63-4005-b0f3-932f5f34e036 could not be coalesced
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]   File "/opt/xensource/sm/cleanup.py", line 2098, in coalesceLeaf
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]     self._coalesceLeaf(vdi)
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]   File "/opt/xensource/sm/cleanup.py", line 2380, in _coalesceLeaf
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]     .format(uuid=vdi.uuid))
                                            Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]
                                            
                                            

                                            I have been able to once again reproduce this multiple times.

                                            Steps to reproduce:

                                            1. Setup a backup job, Enable CBT with snapshot removal and run your first initial full backup.

                                            2. Take a manual snapshot of the VM. After a few mins remove the snapshot and let GC run and complete.

                                            3. Run the same backup job again. For me anyways this usually results in a full backup with the above being dumped to the SM log.

                                            4. Afterwards all backups after this will go back to being delta and CBT will work fine again, unless i take another manual snapshot.

                                            Is anyone else able to reproduce this?

                                            Edit 2: Here is an example of what i am running into.
                                            After running the initial backup job runs:

                                            [23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                                            adde7aaf-6b13-498a-b0e3-f756a57b2e78
                                            

                                            After taking a manual snapshot, the CBT log reference changes as expected:

                                            [23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                                            b6e33794-120a-4a95-b035-af64c6605ee2
                                            

                                            After removing the manual snapshot:

                                            [23:29 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]#   cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 
                                            00000000-0000-0000-0000-000000000000
                                            
                                            F 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post