alejandro-anv

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

You should take a look into XOA because you probably have VDI to coalesce stuck there.

How do I check this?

alejandro-anv

@tomg said in Is there something like VMRC (VMware Remote Console) for XCP-ng?:

If you don't want to use mgmt software like Xen Orchestra you could just connect to the VNC console of the remote VM. This used to work via TCP ports but as of of a few years back qemu started using UNIX sockets by default for the VNC console.

You can see the option being passed by qemu at run time

-vnc unix:/var/run/xen/vnc-3,lock-key-sync=off

where 3 is the domain ID of the domU.

SSH supports UNIX socket forwarding but I had a hard time getting it to work with qemu's VNC socket, it could be umask issue but I didn't look further into it. Socat works well, the only caveat being you need to install it on the Xen host. Here is a write up on how to use it

https://www.nico.schottelius.org/blog/tunneling-qemu-kvm-unix-socket-via-ssh/

HTH

I think this is the right way. It should not be difficult to make a script that determines the port associated to a VM and lanuch ssh with the right parameters to tunnel the port.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

Do you have enough free space on your SR ?

Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

alejandro-anv

@Darkbeldin I did it already restarted toolstack. Didn't help.

alejandro-anv

More investigations about this (may be this helps finding the cause of the problem)

I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

# vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd

I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

In SMlog I see this;

Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
Apr 20 11:34:42 peach SM: [15258]  = 0
Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
Apr 20 11:34:48 peach SM: [15284]  = 0
Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi

But I check and see:

# vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
20480
9451459072
/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
hidden: 0
chain depth: 30

May be it's a bug? looks like the coalesce process ends but does nothing...

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv From the first log line it seems the coalesce process trying to unmount

/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

But i can be wrong.

I see this file no longer exists...

ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
``` directory

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

@alejandro-anv
Mostly you have a mount stuck on your SR that block the coalesce process.
You should try to see why this mount is stuck.

Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

alejandro-anv

@Darkbeldin said in Snapshot chain too long in NFS SR:

You should take a look into XOA because you probably have VDI to coalesce stuck there.

I'm checking XOA but I see nothing that gives me an idea. Even, in XOA I can't see the vdi uuid.

After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

May be this is related to the problem?

Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Apr 20 10:23:05 toad SMGC: [20485]          ***********************
Apr 20 10:23:05 toad SMGC: [20485]          *  E X C E P T I O N  *
Apr 20 10:23:05 toad SMGC: [20485]          ***********************
Apr 20 10:23:05 toad SMGC: [20485] gc: EXCEPTION <class 'util.SMException'>, os.unlink(/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3354, in gc
Apr 20 10:23:05 toad SMGC: [20485]     _gc(None, srUuid, dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3239, in _gc
Apr 20 10:23:05 toad SMGC: [20485]     _gcLoop(sr, dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3205, in _gcLoop
Apr 20 10:23:05 toad SMGC: [20485]     sr.garbageCollect(dryRun)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1794, in garbageCollect
Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDIs(vdiList)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2370, in deleteVDIs
Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDIs(self, vdiList)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1808, in deleteVDIs
Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDI(vdi)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2466, in deleteVDI
Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDI(self, vdi)
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1817, in deleteVDI
Apr 20 10:23:05 toad SMGC: [20485]     vdi.delete()
Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1093, in delete
Apr 20 10:23:05 toad SMGC: [20485]     raise util.SMException("os.unlink(%s) failed" % self.path)
Apr 20 10:23:05 toad SMGC: [20485]
Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
Apr 20 10:23:05 toad SMGC: [20485] * * * * * SR 10410bc3-b762-0b99-6a0b-e61b091de848: ERROR

alejandro-anv

I'm using xcp-ng with SR in a NAS via NFS.

I activated scheduled snapshots for having one snapshot per day and keep 7 of this. It seems it produced a long snapshot chain. I disabled this but now I can't create new snapshots even when deleted all of them.

I'm using vhd-util to check the problem and try to solve it. I see the chain depth is 30 so it can be the problem:

# vhd-util query -vsfd -p -n 09f010db-b4d1-4bea-9f2d-9cb8816241ca.vhd
153600
161300521472
/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/7f834393-f763-4481-8188-c499afe53d9c.vhd
hidden: 0
chain depth: 30

The documentation (https://xcp-ng.org/docs/storage.html#coalesce) says coalesce is done when a snapshot is removed, but I tried to make it manually coarse the disk. Then I run

# vhd-util coalesce -p -n 7f834393-f763-4481-8188-c499afe53d9c.vhd

It takes a long time but after it, the problem persists and the chain depth is still 30. May be vhd-util coalesce does not work in NFS or something like this? May be I'm dselecting the wrong file? (I choose the only one that is not hidden thinking it's the end of the chain)

I read that a "quick" solution is to copy the machine to another SR, but it's in use so I can't stop it for making the copy and I can't take a snapshot to copy the snapshot while the original machine is running because snapshots reports error.

Any sugestions, please?

alejandro-anv

@alejandro-anv

Latest posts made by alejandro-anv