XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Snapshot chain too long in NFS SR

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.6k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      alejandro-anv
      last edited by alejandro-anv

      I'm using xcp-ng with SR in a NAS via NFS.

      I activated scheduled snapshots for having one snapshot per day and keep 7 of this. It seems it produced a long snapshot chain. I disabled this but now I can't create new snapshots even when deleted all of them.

      I'm using vhd-util to check the problem and try to solve it. I see the chain depth is 30 so it can be the problem:

      # vhd-util query -vsfd -p -n 09f010db-b4d1-4bea-9f2d-9cb8816241ca.vhd
      153600
      161300521472
      /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/7f834393-f763-4481-8188-c499afe53d9c.vhd
      hidden: 0
      chain depth: 30
      

      The documentation (https://xcp-ng.org/docs/storage.html#coalesce) says coalesce is done when a snapshot is removed, but I tried to make it manually coarse the disk. Then I run

      # vhd-util coalesce -p -n 7f834393-f763-4481-8188-c499afe53d9c.vhd
      

      It takes a long time but after it, the problem persists and the chain depth is still 30. May be vhd-util coalesce does not work in NFS or something like this? May be I'm dselecting the wrong file? (I choose the only one that is not hidden thinking it's the end of the chain)

      I read that a "quick" solution is to copy the machine to another SR, but it's in use so I can't stop it for making the copy and I can't take a snapshot to copy the snapshot while the original machine is running because snapshots reports error.

      Any sugestions, please?

      DarkbeldinD 1 Reply Last reply Reply Quote 0
      • DarkbeldinD Offline
        Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
        last edited by

        @alejandro-anv

        You should take a look into XOA because you probably have VDI to coalesce stuck there.
        After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

        A 2 Replies Last reply Reply Quote 0
        • A Offline
          alejandro-anv @Darkbeldin
          last edited by

          @Darkbeldin said in Snapshot chain too long in NFS SR:

          You should take a look into XOA because you probably have VDI to coalesce stuck there.

          I'm checking XOA but I see nothing that gives me an idea. Even, in XOA I can't see the vdi uuid.

          After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

          May be this is related to the problem?

          Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
          Apr 20 10:23:05 toad SMGC: [20485]          ***********************
          Apr 20 10:23:05 toad SMGC: [20485]          *  E X C E P T I O N  *
          Apr 20 10:23:05 toad SMGC: [20485]          ***********************
          Apr 20 10:23:05 toad SMGC: [20485] gc: EXCEPTION <class 'util.SMException'>, os.unlink(/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3354, in gc
          Apr 20 10:23:05 toad SMGC: [20485]     _gc(None, srUuid, dryRun)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3239, in _gc
          Apr 20 10:23:05 toad SMGC: [20485]     _gcLoop(sr, dryRun)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3205, in _gcLoop
          Apr 20 10:23:05 toad SMGC: [20485]     sr.garbageCollect(dryRun)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1794, in garbageCollect
          Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDIs(vdiList)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2370, in deleteVDIs
          Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDIs(self, vdiList)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1808, in deleteVDIs
          Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDI(vdi)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2466, in deleteVDI
          Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDI(self, vdi)
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1817, in deleteVDI
          Apr 20 10:23:05 toad SMGC: [20485]     vdi.delete()
          Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1093, in delete
          Apr 20 10:23:05 toad SMGC: [20485]     raise util.SMException("os.unlink(%s) failed" % self.path)
          Apr 20 10:23:05 toad SMGC: [20485]
          Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
          Apr 20 10:23:05 toad SMGC: [20485] * * * * * SR 10410bc3-b762-0b99-6a0b-e61b091de848: ERROR
          
          DarkbeldinD A 2 Replies Last reply Reply Quote 0
          • DarkbeldinD Offline
            Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
            last edited by

            @alejandro-anv
            Mostly you have a mount stuck on your SR that block the coalesce process.
            You should try to see why this mount is stuck.

            1 Reply Last reply Reply Quote 0
            • A Offline
              alejandro-anv
              last edited by

              @Darkbeldin said in Snapshot chain too long in NFS SR:

              @alejandro-anv
              Mostly you have a mount stuck on your SR that block the coalesce process.
              You should try to see why this mount is stuck.

              Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

              DarkbeldinD 1 Reply Last reply Reply Quote 0
              • DarkbeldinD Offline
                Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                last edited by Darkbeldin

                @alejandro-anv From the first log line it seems the coalesce process trying to unmount

                /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

                But i can be wrong.

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  alejandro-anv @Darkbeldin
                  last edited by

                  @Darkbeldin said in Snapshot chain too long in NFS SR:

                  @alejandro-anv From the first log line it seems the coalesce process trying to unmount

                  /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

                  But i can be wrong.

                  I see this file no longer exists...

                  ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
                  ``` directory
                  DarkbeldinD 1 Reply Last reply Reply Quote 0
                  • DarkbeldinD Offline
                    Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                    last edited by

                    @alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      alejandro-anv @alejandro-anv
                      last edited by alejandro-anv

                      More investigations about this (may be this helps finding the cause of the problem)

                      I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

                      # vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                      

                      I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

                      It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

                      In SMlog I see this;

                      Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                      Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
                      Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
                      Apr 20 11:34:42 peach SM: [15258]  = 0
                      Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                      Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                      Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
                      Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                      Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
                      Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
                      Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                      Apr 20 11:34:48 peach SM: [15284]  = 0
                      Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                      

                      But I check and see:

                      # vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                      20480
                      9451459072
                      /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
                      hidden: 0
                      chain depth: 30
                      

                      May be it's a bug? looks like the coalesce process ends but does nothing...

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        alejandro-anv @Darkbeldin
                        last edited by alejandro-anv

                        @Darkbeldin I did it already restarted toolstack. Didn't help.

                        DarkbeldinD 1 Reply Last reply Reply Quote 0
                        • DarkbeldinD Offline
                          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                          last edited by

                          @alejandro-anv Do you have enough free space on your SR ?

                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            alejandro-anv @Darkbeldin
                            last edited by

                            @Darkbeldin said in Snapshot chain too long in NFS SR:

                            Do you have enough free space on your SR ?

                            Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

                            1 Reply Last reply Reply Quote 0
                            • A Offline
                              alejandro-anv @Darkbeldin
                              last edited by

                              @Darkbeldin said in Snapshot chain too long in NFS SR:

                              You should take a look into XOA because you probably have VDI to coalesce stuck there.

                              How do I check this?

                              DarkbeldinD 1 Reply Last reply Reply Quote 0
                              • DarkbeldinD Offline
                                Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                                last edited by

                                @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                                A 1 Reply Last reply Reply Quote 0
                                • A Offline
                                  alejandro-anv @Darkbeldin
                                  last edited by alejandro-anv

                                  @Darkbeldin said in Snapshot chain too long in NFS SR:

                                  @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                                  Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

                                  DarkbeldinD AnonabharA 2 Replies Last reply Reply Quote 0
                                  • DarkbeldinD Offline
                                    Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                                    last edited by

                                    @alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

                                    1 Reply Last reply Reply Quote 0
                                    • AnonabharA Offline
                                      Anonabhar @alejandro-anv
                                      last edited by

                                      @alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

                                      One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

                                      Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

                                      When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

                                      tail -f /var/log/SMlog | grep SMGC
                                      

                                      Its boring.. But.. it gives me a bit of comfort between prayers 8-)

                                      1 Reply Last reply Reply Quote 0
                                      • First post
                                        Last post