VDIs attached to control domain can't be forgotten because they are attached to the control domain

CodeMercenary

I had backups fail earlier this week. After more than 24 hours they were still in the started state. I restarted the XO (from source) VM and the backups are listed as Interrupted. I wasn't surprised by this. Since then two of my VMs fail the backup with VDI must be free or attached to exactly one VM. I was also not terribly surprised by that.

Looking at the health info I can see that the drives for those two VMs are listed under VDIs attached to Control Domain. Since those backups failed the other day I know it's safe to forget them. I click the forget button and they remain. I check the logs and the log says:
OPERATION_NOT_ALLOWED(VBD '<guid>' still attached to '<guid for id 0>')

SSH into the host and run list_domains and that's how I confirmed that the second guid is for id 0 which I assume is the control domain since I see it frequently referred to as dom0, right?

That means I need to detach/forget it from dom0 but I can't because it's attached to dom0. At least that's what it feels like is happening.

I tried creating a snapshot for one of the VMs then deleting it to see if that would fix it. I had seen in another forum post that someone fixed it that way. Didn't work.

Must I reboot the host to fix this? This is my primary host and I don't have shared storage so that's kinda a pain. I'll do it if that'll fix my backups but I don't want to do that if there's another reasonable way to fix it without such extreme measures.

andrewperry

@CodeMercenary Great question! Similar situation here. Sorry you didn't get a response.

What did you end up doing?

olivierlambert

Hi,

It's really hard to answer because it could be many things. Without digging, the easiest way is to reboot the Dom0. Sometimes, you can have a stuck process open on a volume (eg tapdisk). But the backup code is providing more and more fail-safe to avoid getting to that point, please be sure to have always the latest commit.

andrewperry

@olivierlambert thanks, my comment wasn’t meant as a criticism. I appreciate all you share so freely in both software and in the forums! Just was wondering how another user got on.

I have benefitted immensely thanks to Xen on bare metal over the years and while some people push towards KVM, I found the import process to XCP-NG very smooth once I got python2 built on the Debian12 dom0s that have served so well.

rdiff-backup was a really solid backup method for us using LVM snapshots and iscsi, so it’s just taking some time to get used to the GUI and the new CLI - and that the CLI can help resolve things stuck in the GUI. Your forums and personal posts are gold.

Automated boot testing after a backup is fantastic.

Thanks for everything!

olivierlambert

Don't worry, happy to help and to make sure everything works great for everyone (which is a challenge knowing there's the same number of unique setup than users )

Tristis Oris

last weeks i always get that problem. Only way to detach vdi from dom0 - reboot host(
xe vbd-unplug uuid= didn't work at that case.

CodeMercenary

@andrewperry Sorry for the delay in responding; I think you posted while I was on vacation. I ended up rebooting the host and have not had the problem return since. Uh oh, I hope I didn't just jinx myself.

rtjdamen

@CodeMercenary we have seen this multiple times before, if there is an issue with the backup job or a backup fails unexpected the NBD connection for that host is not closed correctly and it holds the snapshot. The only way so far is by rebooting the host to release this lock. However i understand from vates support that if this is the case there is a workaround available to release this vdi by carefully killing the specific xapi-nbd process on that host. I don't know how this works and i think this is not something that u should do yourself but maybe support can assist you with this.

CodeMercenary

@rtjdamen Good to know, thank you. I'm not likely to try to do that without fully understanding what I'm doing. I will just continue to reboot the host if this happens again. It did happen again a week or two ago. My infrastructure isn't so complex that it's impossible to reboot the host, it's just annoying because I have to stay late so nobody is using the servers and I've had servers that failed to boot up after a restart that should have been trivial so I'm always a bit nervous. That was years ago and not when running XCP-ng but it left an emotional scar. I just had the same thing happen with my UNRAID server last Friday, had to clear the BIOS settings to get it to boot again.

rtjdamen

@CodeMercenary haha, i know what u mean, restarting things is allways giving other issues, i am happy a workaround is available to get the nbd connections to be released. I asked vates if this is something that could be implemented in XOA in the future, that would make everybody's live a lot more easy!

CodeMercenary

@rtjdamen I have become suspicious that my backups might not be as messed up as I thought. Yesterday I noticed that my backups were again in started state long after they would normally be completed. Because some of those backups go to the UNRAID server I mentioned, I decided to do some digging. That UNRAID server does have incoming network activity so I became suspicious that the backups are working just very slow. I checked it again this morning and one of the backups completed after 16 hours, the other one completed after 21 hours. In this case I think the issue is that UNRAID is using the 1Gb adapter instead of the 10Gb adapter.

In the future I'm going to be more careful about deciding that a backup is stuck. I'd like to figure out if there's a way to get more insight into what is happening in the backup, like an ongoing percentage complete and a data transfer speed or total. Would be nice not to have to look at the receiving side for traffic and assume that's the backup, plus some of my backup targets aren't as easy to tell the incoming traffic.

Now I have to dare rebooting the UNRAID server again to see if I can get it to use the right network connection. It must have gotten out of whack when I reset the BIOS and I need to get it back in whack.

rtjdamen

@CodeMercenary i have seen this as well, one of our backup repos was almost out of space, this decreased the backup speed, we thought of issues on the xoa side but it was just related to the repo itself…. So it is good to check both. The 1 gig part makes sense.