VDI Won't Coalesce (shows orphaned but isn't)

planedrop

To add some additional detail:

Once XOA stops showing the VDI being coalesced, the cycle starts again, first it'll show this VDI has a depth of 1 that needs to be coalesced, then it will change to 2 a few minutes later, and then it will fail again and the loop restarts.

olivierlambert

So you have a problem on coalesce on the host

planedrop

@olivierlambert Any idea what that problem is? It works 100% perfectly for all other VMs on this host, never a single issue with coalesce, so not sure why it's happening with this one.

Or does this look like it should be a host wide issue?

Danp

@olivierlambert You don't say!

planedrop

So I was able to confirm that other VMs for sure are coalescing just fine.

While I was digging through the logs for that, I noticed something, a little background might help.

Originally this VDI was 160GB when transferred as VHD from Hyper-V, I then migrated to another SR and back to the local one, then resized it to 180GB.

The SR VDI Chain shows this:
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *2d12ea03(160.000G/142.467G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *156a132d(180.000G/41.425G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] fb2d9abb(180.000G/10.383M)

Which leads me to wonder if the resize caused some issues or something. Additionally I did have a 160GB orphaned VDI from the other SR which I deleted from the Health page.

planedrop

@planedrop said in VDI Won't Coalesce (shows orphaned but isn't):

So I was able to confirm that other VMs for sure are coalescing just fine.

While I was digging through the logs for that, I noticed something, a little background might help.

Originally this VDI was 160GB when transferred as VHD from Hyper-V, I then migrated to another SR and back to the local one, then resized it to 180GB.

The SR VDI Chain shows this:
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *2d12ea03(160.000G/142.467G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *156a132d(180.000G/41.425G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] fb2d9abb(180.000G/10.383M)

Which leads me to wonder if the resize caused some issues or something. Additionally I did have a 160GB orphaned VDI from the other SR which I deleted from the Health page.

If someone can clarify what the above means when it comes to the VDI chain that'd be awesome. Like the 41.425G one, I'm not sure what that means. Is this indicating that the original size used was 142.467G (which is correct) and then after the increase to 180GB it's only using 41.425G? Or is that 41.425G a reference of some sort?

Thanks again for any help, this one is really tripping me up.

olivierlambert

As long as you have "exception" displayed in the SMlog, you have coalesce issues on that SR. Could be the SR itself or a broken VDI.

You could check the problematic VHD and its parents with vhd-util to see if there's header or footer issues. Alternatively, you can migrate it to another SR, check if coalesce is back on track, then migrate it back.

planedrop

@olivierlambert I'm trying the migration option right now, if that doesn't work I'll do some digging with vhd-util. I don't think there is an issue with the SR as a whole though, as I've had a very large amount of successful snapshots and coalesce's on other VDIs on this SR, none of them ever came back up with exceptions or anything like that, so I'd guess a broken VDI.

I'll report back my findings and go from there, if I don't have it figured out this weekend I'll submit an official support ticket about it.

Thanks for the help here!!

planedrop

@olivierlambert So I may have figured out what happened, wanted to see if this sounds possible.

I think I mistakenly snapshotted this VDI after moving it to another SR, then moved it back without first deleting that snapshot, THEN resized the VDI.

So I don't think it was able to merge the snapshots.

After moving it to that other SR and then back to our main SR, it hasn't tried to coalesce at all and I'm not seeing any exceptions in the SMLog.

Going to boot back up this VM and see if the issue comes back later or not but it's been an entire day now with no Exceptions, and it was having those about every 30 minutes.

planedrop

So another odd thing I'm seeing with this VDI, it's showing the size incorrectly. It shows 180GB of 180GB used up (on thin provisioned SR, both the old and new are), however the VM is only using 140GB of that 180GB.

Something definitely went wrong with this VDI during transfer, just not sure what.

I will say that I increased the VDI size again and now it displays more accurately, showing 180GB of 185GB used (both in XOA and with vhd-util). Almost behaving as if this was at one point on a thick provisioned SR or something.

Just to avoid issues I'm maybe tempted to create a fresh VHD, copy data to that, then delete this one.