Snapshot chain too long in NFS SR

alejandro-anv

I'm using xcp-ng with SR in a NAS via NFS.

I activated scheduled snapshots for having one snapshot per day and keep 7 of this. It seems it produced a long snapshot chain. I disabled this but now I can't create new snapshots even when deleted all of them.

I'm using vhd-util to check the problem and try to solve it. I see the chain depth is 30 so it can be the problem:

# vhd-util query -vsfd -p -n 09f010db-b4d1-4bea-9f2d-9cb8816241ca.vhd
153600
161300521472
/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/7f834393-f763-4481-8188-c499afe53d9c.vhd
hidden: 0
chain depth: 30

The documentation (https://xcp-ng.org/docs/storage.html#coalesce) says coalesce is done when a snapshot is removed, but I tried to make it manually coarse the disk. Then I run

# vhd-util coalesce -p -n 7f834393-f763-4481-8188-c499afe53d9c.vhd

It takes a long time but after it, the problem persists and the chain depth is still 30. May be vhd-util coalesce does not work in NFS or something like this? May be I'm dselecting the wrong file? (I choose the only one that is not hidden thinking it's the end of the chain)

I read that a "quick" solution is to copy the machine to another SR, but it's in use so I can't stop it for making the copy and I can't take a snapshot to copy the snapshot while the original machine is running because snapshots reports error.

Any sugestions, please?

Darkbeldin

@alejandro-anv

You should take a look into XOA because you probably have VDI to coalesce stuck there.
After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

You should take a look into XOA because you probably have VDI to coalesce stuck there.

I'm checking XOA but I see nothing that gives me an idea. Even, in XOA I can't see the vdi uuid.

After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

May be this is related to the problem?

Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Apr 20 10:23:05 toad SMGC: [20485]          ***********************
Apr 20 10:23:05 toad SMGC: [20485]          *  E X C E P T I O N  *
Apr 20 10:23:05 toad SMGC: [20485]          ***********************
Apr 20 10:23:05 toad SMGC: [20485] gc: EXCEPTION <class 'util.SMException'>, os.unlink(/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3354, in gc
Apr 20 10:23:05 toad SMGC: [20485]     _gc(None, srUuid, dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3239, in _gc
Apr 20 10:23:05 toad SMGC: [20485]     _gcLoop(sr, dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3205, in _gcLoop
Apr 20 10:23:05 toad SMGC: [20485]     sr.garbageCollect(dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1794, in garbageCollect
Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDIs(vdiList)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2370, in deleteVDIs
Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDIs(self, vdiList)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1808, in deleteVDIs
Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDI(vdi)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2466, in deleteVDI
Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDI(self, vdi)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1817, in deleteVDI
Apr 20 10:23:05 toad SMGC: [20485]     vdi.delete()
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1093, in delete
Apr 20 10:23:05 toad SMGC: [20485]     raise util.SMException("os.unlink(%s) failed" % self.path)
Apr 20 10:23:05 toad SMGC: [20485]
Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Apr 20 10:23:05 toad SMGC: [20485] * * * * * SR 10410bc3-b762-0b99-6a0b-e61b091de848: ERROR

Darkbeldin

@alejandro-anv
Mostly you have a mount stuck on your SR that block the coalesce process.
You should try to see why this mount is stuck.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv
Mostly you have a mount stuck on your SR that block the coalesce process.
You should try to see why this mount is stuck.

Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

Darkbeldin

@alejandro-anv From the first log line it seems the coalesce process trying to unmount

/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

But i can be wrong.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv From the first log line it seems the coalesce process trying to unmount

/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

But i can be wrong.

I see this file no longer exists...

ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
``` directory

Darkbeldin

@alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

alejandro-anv

More investigations about this (may be this helps finding the cause of the problem)

I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

# vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd

I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

In SMlog I see this;

Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
Apr 20 11:34:42 peach SM: [15258]  = 0
Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
Apr 20 11:34:48 peach SM: [15284]  = 0
Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi

But I check and see:

# vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
20480
9451459072
/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
hidden: 0
chain depth: 30

May be it's a bug? looks like the coalesce process ends but does nothing...

alejandro-anv

@Darkbeldin I did it already restarted toolstack. Didn't help.

Darkbeldin

@alejandro-anv Do you have enough free space on your SR ?

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

Do you have enough free space on your SR ?

Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

You should take a look into XOA because you probably have VDI to coalesce stuck there.

How do I check this?

Darkbeldin

@alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

Darkbeldin

@alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

Anonabhar

@alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

tail -f /var/log/SMlog | grep SMGC

Its boring.. But.. it gives me a bit of comfort between prayers 8-)