VDI Won't Coalesce (shows orphaned but isn't)

Danp

Check the logs (smlog in particular) -- https://xcp-ng.org/docs/troubleshooting.html#log-files

planedrop

@danp I'll give this a look.

After a little more testing, it appears the coalesce is NOT failing but rather happening when I don't expect it too. After waiting a while the coalesce finished and is no longer listed in the SR (and the original VDI is the correct combined size as it should be).

I wonder if some kind of snapshot is happening without me realizing it.

As I'm typing this another coalesce process showed up so I'll keep digging.

planedrop

So this is what I'm seeing related to this VDI:

Aug 27 14:14:29 xcp-ng-1 SMGC: [17466] coalesce: EXCEPTION <class 'util.CommandException'>, Invalid argument
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/cleanup.py", line 1753, in coalesce
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     self._coalesce(vdi)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/cleanup.py", line 1942, in _coalesce
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     vdi._doCoalesce()
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/cleanup.py", line 766, in _doCoalesce
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     self.parent._increaseSizeVirt(self.sizeVirt)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/cleanup.py", line 969, in _increaseSizeVirt
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     self._setSizeVirt(size)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/cleanup.py", line 984, in _setSizeVirt
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     vhdutil.setSizeVirt(self.path, size, jFile)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/vhdutil.py", line 237, in setSizeVirt
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     ioretry(cmd)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/vhdutil.py", line 102, in ioretry
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     errlist = [errno.EIO, errno.EAGAIN])
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/util.py", line 330, in ioretry
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     return f()
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/vhdutil.py", line 101, in <lambda>
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     return util.ioretry(lambda: util.pread2(cmd),
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/util.py", line 227, in pread2
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     return pread(cmdlist, quiet = quiet)
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]   File "/opt/xensource/sm/util.py", line 190, in pread
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]     raise CommandException(rc, str(cmdlist), stderr.strip())
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466]
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Aug 27 14:14:29 xcp-ng-1 SMGC: [17466] Coalesce failed, skipping

Is this maybe an issue with the resizing I tried on this VDI (and appeared successful) when I did the migration?

planedrop

To add some additional detail:

Once XOA stops showing the VDI being coalesced, the cycle starts again, first it'll show this VDI has a depth of 1 that needs to be coalesced, then it will change to 2 a few minutes later, and then it will fail again and the loop restarts.

olivierlambert

So you have a problem on coalesce on the host

planedrop

@olivierlambert Any idea what that problem is? It works 100% perfectly for all other VMs on this host, never a single issue with coalesce, so not sure why it's happening with this one.

Or does this look like it should be a host wide issue?

Danp

@olivierlambert You don't say!

planedrop

So I was able to confirm that other VMs for sure are coalescing just fine.

While I was digging through the logs for that, I noticed something, a little background might help.

Originally this VDI was 160GB when transferred as VHD from Hyper-V, I then migrated to another SR and back to the local one, then resized it to 180GB.

The SR VDI Chain shows this:
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *2d12ea03(160.000G/142.467G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *156a132d(180.000G/41.425G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] fb2d9abb(180.000G/10.383M)

Which leads me to wonder if the resize caused some issues or something. Additionally I did have a 160GB orphaned VDI from the other SR which I deleted from the Health page.

planedrop

@planedrop said in VDI Won't Coalesce (shows orphaned but isn't):

So I was able to confirm that other VMs for sure are coalescing just fine.

While I was digging through the logs for that, I noticed something, a little background might help.

Originally this VDI was 160GB when transferred as VHD from Hyper-V, I then migrated to another SR and back to the local one, then resized it to 180GB.

The SR VDI Chain shows this:
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *2d12ea03(160.000G/142.467G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] *156a132d(180.000G/41.425G)
Aug 27 15:00:51 xcp-ng-1 SMGC: [14113] fb2d9abb(180.000G/10.383M)

Which leads me to wonder if the resize caused some issues or something. Additionally I did have a 160GB orphaned VDI from the other SR which I deleted from the Health page.

If someone can clarify what the above means when it comes to the VDI chain that'd be awesome. Like the 41.425G one, I'm not sure what that means. Is this indicating that the original size used was 142.467G (which is correct) and then after the increase to 180GB it's only using 41.425G? Or is that 41.425G a reference of some sort?

Thanks again for any help, this one is really tripping me up.

olivierlambert

As long as you have "exception" displayed in the SMlog, you have coalesce issues on that SR. Could be the SR itself or a broken VDI.

You could check the problematic VHD and its parents with vhd-util to see if there's header or footer issues. Alternatively, you can migrate it to another SR, check if coalesce is back on track, then migrate it back.

planedrop

@olivierlambert I'm trying the migration option right now, if that doesn't work I'll do some digging with vhd-util. I don't think there is an issue with the SR as a whole though, as I've had a very large amount of successful snapshots and coalesce's on other VDIs on this SR, none of them ever came back up with exceptions or anything like that, so I'd guess a broken VDI.

I'll report back my findings and go from there, if I don't have it figured out this weekend I'll submit an official support ticket about it.

Thanks for the help here!!

planedrop

@olivierlambert So I may have figured out what happened, wanted to see if this sounds possible.

I think I mistakenly snapshotted this VDI after moving it to another SR, then moved it back without first deleting that snapshot, THEN resized the VDI.

So I don't think it was able to merge the snapshots.

After moving it to that other SR and then back to our main SR, it hasn't tried to coalesce at all and I'm not seeing any exceptions in the SMLog.

Going to boot back up this VM and see if the issue comes back later or not but it's been an entire day now with no Exceptions, and it was having those about every 30 minutes.

planedrop

So another odd thing I'm seeing with this VDI, it's showing the size incorrectly. It shows 180GB of 180GB used up (on thin provisioned SR, both the old and new are), however the VM is only using 140GB of that 180GB.

Something definitely went wrong with this VDI during transfer, just not sure what.

I will say that I increased the VDI size again and now it displays more accurately, showing 180GB of 185GB used (both in XOA and with vhd-util). Almost behaving as if this was at one point on a thick provisioned SR or something.

Just to avoid issues I'm maybe tempted to create a fresh VHD, copy data to that, then delete this one.