Try removing the Xentools and running the update. Then reinstalling the tools after. I have had luck doing this when upgrading Windows server from 2022 to 2025, may also apply here as well.
Posts
-
RE: Win11 VM update 23H2 -> 24H2 fail
-
RE: CBT: the thread to centralize your feedback
@olivierlambert that could for sure be the case. We are using NBD now without Purge snapshot data enabled for now and its been very reliable but hoping to keep chipping away at these issues so we can one day enable this on our production VMs.
if there is any testing you need me to do just let me know as we have a decent test environment setup where we prototype these things before deploying for real.
-
RE: CBT: the thread to centralize your feedback
@olivierlambert This seems to be something different as i don't need to migrate a VM for this to happen. Simply have a VM with CBT enabled, and a .cbtlog file present for it, then create a regular snapshot of the VM. Upon deleting that manually created snapshot CBT data will become reset.
It happens to me on shared NFS SRs. I have not been able to make it happen on a local EXT SR. But i have this happening across 3 different pools using NFS SRs now. There is a ton of info in my posts in an attempt to be as detailed as possible! Happy to help anyway i can!
-
RE: CBT: the thread to centralize your feedback
3rd update. This appears to happen on or test pool using NFS (TrueNAS NFS), our DR pool (Pure Storage NFS) and on our production pool (Pure storage NFS)
Testing more today this seems to occur on a shared NFS SRs where multiple hosts are connected, using local EXT storage i do not see this behaviour.
If theres any debug could i could enable to help get to the bottom of this. Or if someone else can also confirm this happens to them we can rule out something in my environments.
-
RE: CBT: the thread to centralize your feedback
Another test on a different pool seems to yield the same result:
Create VM, and add to a backup job using CBT with Snapshot deletion. Run backup job to generate the .cbtlog file.
After first backup run:
[08:22 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog adde7aaf-6b13-498a-b0e3-f756a57b2e78
Next take a snapshot of the VM using Xen orchestra from the Snapshot tab, check the CBT log file again, it now references the newly created snapshot:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 994174ef-c579-44e6-bc61-240fb996867e
Remove the manually created snapshot, and check the CBT log file and find that it has been corrupted:
[08:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n dfa26980-edc4-4127-a032-cfd99226a5b8.cbtlog 00000000-0000-0000-0000-000000000000
So far i can make this happen on two different pools. Would be helpful if anyone else could confirm this.
-
RE: CBT: the thread to centralize your feedback
MASSIVE EDIT AFTER FURTHER TESTING
So i have another one in my testing with CBT.
If i have VM running with CBT backups with Snapshot deletion enabled, and i remove the pool setting to specify a migration network everything appears fine and CBT data will not reset due to a migration.
However if it take a manual snapshot on a VM, and remove the snapshot after i find CBT data sometimes resets itself:
SM log shows:
[15:53 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# grep -A 5 -B 5 -i exception /var/log/SMlog Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] In cleanup Jan 28 11:55:00 xcpng-test-01 SMGC: [2041921] SR 9330 ('Syn-TestLab-DS1') (0 VDIs in 0 VHD trees): no changes Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/running Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/gc_active Jan 28 11:55:00 xcpng-test-01 SM: [2041921] lock: closed /var/lock/sm/93308f90-1fcd-873b-292f-4a34dde2bfea/sr Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** sr_scan: EXCEPTION <class 'util.CommandException'>, Input/output error Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] return self._run_locked(sr) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked Jan 28 11:55:09 xcpng-test-01 SM: [2041073] rv = self._run(sr, target) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 370, in _run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] return sr.scan(self.params['sr_uuid']) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/ISOSR", line 594, in scan Jan 28 11:55:09 xcpng-test-01 SM: [2041073] if not util.isdir(self.path): Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/util.py", line 542, in isdir Jan 28 11:55:09 xcpng-test-01 SM: [2041073] raise CommandException(errno.EIO, "os.stat(%s)" % path, "failed") Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] Raising exception [40, The SR scan failed [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error]] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ***** ISO: EXCEPTION <class 'xs_errors.SROSError'>, The SR scan failed [opterr=Command os.stat(/var/run/sr-mount/d00054f9-e6a2-162f-f734-1c6c02541722) failed (failed): Input/output error] Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 385, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] ret = cmd.run(sr) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] File "/opt/xensource/sm/SRCommand.py", line 121, in run Jan 28 11:55:09 xcpng-test-01 SM: [2041073] raise xs_errors.XenError(excType, opterr=msg) Jan 28 11:55:09 xcpng-test-01 SM: [2041073] -- Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: opening lock file /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: acquired /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ['/usr/sbin/cbt-util', 'get', '-n', '/var/run/sr-mount/45e457aa-16f8-41e0-d03d-8201e69638be/58242a5a-0a6f-4e4e-bada-8331ed32eae4.cbtlog', '-c'] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] pread SUCCESS Jan 28 14:41:58 xcpng-test-01 SM: [2181235] lock: released /var/lock/sm/58242a5a-0a6f-4e4e-bada-8331ed32eae4/cbtlog Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Raising exception [460, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated]] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** generic exception: vdi_list_changed_blocks: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return self._run_locked(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked Jan 28 14:41:58 xcpng-test-01 SM: [2181235] rv = self._run(sr, target) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 326, in _run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return target.list_changed_blocks() Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/VDI.py", line 759, in list_changed_blocks Jan 28 14:41:58 xcpng-test-01 SM: [2181235] "Source and target VDI are unrelated") Jan 28 14:41:58 xcpng-test-01 SM: [2181235] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ***** NFS VHD: EXCEPTION <class 'xs_errors.SROSError'>, Failed to calculate changed blocks for given VDIs. [opterr=Source and target VDI are unrelated] Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 385, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] ret = cmd.run(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 111, in run Jan 28 14:41:58 xcpng-test-01 SM: [2181235] return self._run_locked(sr) Jan 28 14:41:58 xcpng-test-01 SM: [2181235] File "/opt/xensource/sm/SRCommand.py", line 161, in _run_locked -- Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] Removed leaf-coalesce from fe6e3edd(100.000G/7.483M?) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~* Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *********************** Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] * E X C E P T I O N * Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] *********************** Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] leaf-coalesce: EXCEPTION <class 'util.SMException'>, VDI fe6e3edd-4d63-4005-b0f3-932f5f34e036 could not be coalesced Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] File "/opt/xensource/sm/cleanup.py", line 2098, in coalesceLeaf Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] self._coalesceLeaf(vdi) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] File "/opt/xensource/sm/cleanup.py", line 2380, in _coalesceLeaf Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532] .format(uuid=vdi.uuid)) Jan 28 15:53:18 xcpng-test-01 SMGC: [2250532]
I have been able to once again reproduce this multiple times.
Steps to reproduce:
-
Setup a backup job, Enable CBT with snapshot removal and run your first initial full backup.
-
Take a manual snapshot of the VM. After a few mins remove the snapshot and let GC run and complete.
-
Run the same backup job again. For me anyways this usually results in a full backup with the above being dumped to the SM log.
-
Afterwards all backups after this will go back to being delta and CBT will work fine again, unless i take another manual snapshot.
Is anyone else able to reproduce this?
Edit 2: Here is an example of what i am running into.
After running the initial backup job runs:[23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog adde7aaf-6b13-498a-b0e3-f756a57b2e78
After taking a manual snapshot, the CBT log reference changes as expected:
[23:27 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog b6e33794-120a-4a95-b035-af64c6605ee2
After removing the manual snapshot:
[23:29 xcpng-test-01 45e457aa-16f8-41e0-d03d-8201e69638be]# cbt-util get -c -n 2be6b6ec-9308-4e63-9975-19259108eba2.cbtlog 00000000-0000-0000-0000-000000000000
-
-
RE: XCP-ng 8.3 updates announcements and testing
@gduperrey Installed on the same 2 hosts as the last batch of test updates released in December.
No issues to report so far, ran a backup job after after without issue.
-
RE: Manual snapshots retention
@olivierlambert I wonder if it would be beneficial to show all snapshots older than 30 days not just ones not created by an automated process. For example what happens if a XO backup job runs, creates a snapshot but the process is interrupted and the job fails. Will the next time the job runs clean up the previous failed jobs snapshot or is there a chance a snapshot could be left behind?
-
RE: SR Garbage Collection running permanently
I have ran into this numerous times. Its one of the reasons i have not switched to "Purge Snapshot data when using CBT" on all my jobs yet.
I hope the fixes in testing solve the issue, what has been fixing it for me in the meantime is modifying the following:
Edit /opt/xensource/sm/cleanup.py : Modify LIVE_LEAF_COALESCE_MAX_SIZE and LIVE_LEAF_COALESCE_TIMEOUT to the following values: LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 # bytes LIVE_LEAF_COALESCE_TIMEOUT = 300 # seconds
-
RE: Manual snapshots retention
Im not sure what the best criteria would be for listing the snapshot on the health check page. Perhaps if the snapshot is over 30 days old? There is already "Too many snapshots" (Not sure what counts as too many?) and "Orphaned VMs snapshot" perhaps just simply another section called "Old Snapshots"?
-
RE: Manual snapshots retention
I have often thought this might be something worth having in the health check area. Coming from VMware we used to frequently run a tool called "RVTools" that could show you old snapshots that may have been forgotten about or snapshots that were created by a failed backup job that were not removed properly. This would be useful to have to find snapshots like that and remove them before they become a problem.
-
RE: CBT: the thread to centralize your feedback
As an update we just spun up our DR pool yesterday,a fresh install of XCP-NG 8.3 on all hosts in the pool. Testing migrations and backups with CBT enabled shows the same behaviour we experience on the other pools. Removing the default migration network allows CBT to work properly, however specifying a default migration network causes CBT to be reset after a VM migration. So i think this is pretty reproducible
at least using a file based SR like NFS.
-
RE: Bonded interface viewing support in XO
@olivierlambert Not to dig up an old thread but was this ever added? I was looking around and wasn't able to find it anywhere.
-
RE: How to migrate XOA itself?
@DustinB Are the any downsides to having two XOA instances pointing at the same pool? Since the config itself is stored at the pool level im guessing theres no downside?
IE: Priimary XOA running in core DC and secondary XOA running at your DR site. Is it just a matter of adding the pool on the secondary XOA and it downloads the existing config or did you need to do a full export / import?
-
RE: How to migrate XOA itself?
@manilx When i did this i used the xe CLI, you can also use xcp-ng center but with 8.3 you'll need to download a beta version linked on the xcp-ng center forum thread.
xe vm-migrate uuid=UUID_OF_XOA_VM remote-master=new_pool_master_IP remote-username=root remote-password=PASSWORD host-uuid=destination_host_uuid vdi:vdi_uuid=destination_sr_uuid vif:source_vif_uuid=destination_network_uuid
Docs: https://docs.xenserver.com/en-us/citrix-hypervisor/command-line-interface.html#vm-migrate
-
RE: CBT: the thread to centralize your feedback
So in the case where CBT is being reset the network of the VM is not actually being changed during migration. The VM is moving from Host A to Host B within the same pool, using NFS shared storage which is also not changing. However when "Default Migration Network" in the pools advanced tab is set on the pool, CBT data is reset. When a default migration network is not set, the CBT data remains in tact.
I seems like
migrate_send
will always reset CBT data during a migration then even if its within the same pool on shared storage and that this is used when a default migration network is specified in XO's Pool - Advanced tab. Whilevm.pool_migrate
will not reset CBT but is only used when a default migration network is NOT set in XO's Pool - Advanced tab. Not sure how we work around that short of not using a dedicated migration network? -
RE: CBT: the thread to centralize your feedback
Thanks for the tip!
Looking at the output:
command name : vm-migrate reqd params : optional params : live, host, host-uuid, remote-master, remote-username, remote-password, remote-network, force, copy, compress, vif:, vdi:, <vm-selectors>
Ir does not appear there is a way for me to specify a migration network using the vm-migrate command?
It sounds to me like
vm.migrate_send
is causing CBT to be reset whilevm.pool_migrate
is leaving it intact? The difference between a migration that is known to be kept within a pool vs one that could potentially be migrating a VM anywhere? -
RE: CBT: the thread to centralize your feedback
I think we have a pretty good idea of the cause now, It seems to be related to having a migration network specific at the pool level.
I think we are closer than ever to having this worked out and should help a lot of us using a dedicated migration network. (As was best practice in Vmware land) What are the next steps we need to take?
-
RE: CBT: the thread to centralize your feedback
@olivierlambert @MathieuRA once you are able to provide me xe migrate flag to specify a migration network i will test this ASAP. I think we're really close to getting to the bottom of this issue!
-
RE: Replicating a Back Repository using ZFS send/Rsync
@olivierlambert Makes sense! I would schedule the replication to occur a couple hours after the backup runs are complete to ensure its a replica of all data and not a partial replica!