Replication is leaving VDIs attached to Control Domain, again

olivierlambert

florent

@Andrew is it always the same VM/disk ?

Andrew

@florent Different random ones.

Andrew

@florent With CR running and NBD enabled for 2, I see both exports and one import (per disk). It's never the import that's stuck and only one (not both) of the exports (if it happens).

I have updated XCP 8.3 to the new January 2026 patch and XO to current master and will keep an eye on it again.

florent

@Andrew and the CR is completing correctly ?

Andrew

@florent Yes.

Andrew

@florent Delta backup is also leaving old snapshots on some VMs. It should only have one (current) snapshot for the nightly backup. This is an issue on 1 of 100 VMs.

XCP (Jan 2026 update) and XO (91c5d) are current.

Andrew

@florent I rebuilt my XCP hosting environment (everything is faster and bigger stuffed into one rack).... and this issue is now worse.

The main changes to this new setup are 2x40Gb networking, faster NFS NVMe NAS, faster pool servers, more memory, much faster CR destination machine with ZFS.

Running XCP 8.3 (March 2026 updates) and XO (master a2e33).

Replication is leaving many attached to control domain every day with NBD connection set to 2. Changing it to 1 seems to resolve the issue (no more stuck to control domain).

poddingue

Thanks for the detailed report and the NBD=2 vs NBD=1 correlation, Andrew, that's a genuinely useful clue.
From what I understand, a VDI staying attached to dom0 like this tends to point at the storage side on the XCP-ng host rather than XO's replication logic itself, so it's probably worth a look from @Team-Storage.
To give them something concrete to chase, would you be able to share your SR type/storage backend, plus the SMlog (and kern.log) from around the time a VDI gets left stuck?
With those details, the storage folks would have a much better starting point.
Thanks again for staying on top of it.

Andrew

@poddingue Since setting NBD=1, I have not seen the problem.

SR is NFS on dual 40G ethernet with a TrueNAS scale 25.10 server using all NVMe SSD, so storage performance is as good as I can make it.

I'll have to enable NBD=2 again to see if it still happens and if I can find the relevant part of the logs. As this is a random problem I can't recreate it on a normal test environment.