XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Snapshot chain too long in NFS SR

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.6k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DarkbeldinD Offline
      Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
      last edited by

      @alejandro-anv

      You should take a look into XOA because you probably have VDI to coalesce stuck there.
      After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

      A 2 Replies Last reply Reply Quote 0
      • A Offline
        alejandro-anv @Darkbeldin
        last edited by

        @Darkbeldin said in Snapshot chain too long in NFS SR:

        You should take a look into XOA because you probably have VDI to coalesce stuck there.

        I'm checking XOA but I see nothing that gives me an idea. Even, in XOA I can't see the vdi uuid.

        After that taking a look at the SMlog of your host would be the next step to see what's causing the coalesce to be stuck.

        May be this is related to the problem?

        Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
        Apr 20 10:23:05 toad SMGC: [20485]          ***********************
        Apr 20 10:23:05 toad SMGC: [20485]          *  E X C E P T I O N  *
        Apr 20 10:23:05 toad SMGC: [20485]          ***********************
        Apr 20 10:23:05 toad SMGC: [20485] gc: EXCEPTION <class 'util.SMException'>, os.unlink(/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3354, in gc
        Apr 20 10:23:05 toad SMGC: [20485]     _gc(None, srUuid, dryRun)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3239, in _gc
        Apr 20 10:23:05 toad SMGC: [20485]     _gcLoop(sr, dryRun)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 3205, in _gcLoop
        Apr 20 10:23:05 toad SMGC: [20485]     sr.garbageCollect(dryRun)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1794, in garbageCollect
        Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDIs(vdiList)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2370, in deleteVDIs
        Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDIs(self, vdiList)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1808, in deleteVDIs
        Apr 20 10:23:05 toad SMGC: [20485]     self.deleteVDI(vdi)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 2466, in deleteVDI
        Apr 20 10:23:05 toad SMGC: [20485]     SR.deleteVDI(self, vdi)
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1817, in deleteVDI
        Apr 20 10:23:05 toad SMGC: [20485]     vdi.delete()
        Apr 20 10:23:05 toad SMGC: [20485]   File "/opt/xensource/sm/cleanup.py", line 1093, in delete
        Apr 20 10:23:05 toad SMGC: [20485]     raise util.SMException("os.unlink(%s) failed" % self.path)
        Apr 20 10:23:05 toad SMGC: [20485]
        Apr 20 10:23:05 toad SMGC: [20485] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
        Apr 20 10:23:05 toad SMGC: [20485] * * * * * SR 10410bc3-b762-0b99-6a0b-e61b091de848: ERROR
        
        DarkbeldinD A 2 Replies Last reply Reply Quote 0
        • DarkbeldinD Offline
          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
          last edited by

          @alejandro-anv
          Mostly you have a mount stuck on your SR that block the coalesce process.
          You should try to see why this mount is stuck.

          1 Reply Last reply Reply Quote 0
          • A Offline
            alejandro-anv
            last edited by

            @Darkbeldin said in Snapshot chain too long in NFS SR:

            @alejandro-anv
            Mostly you have a mount stuck on your SR that block the coalesce process.
            You should try to see why this mount is stuck.

            Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

            DarkbeldinD 1 Reply Last reply Reply Quote 0
            • DarkbeldinD Offline
              Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
              last edited by Darkbeldin

              @alejandro-anv From the first log line it seems the coalesce process trying to unmount

              /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

              But i can be wrong.

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                alejandro-anv @Darkbeldin
                last edited by

                @Darkbeldin said in Snapshot chain too long in NFS SR:

                @alejandro-anv From the first log line it seems the coalesce process trying to unmount

                /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

                But i can be wrong.

                I see this file no longer exists...

                ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
                ``` directory
                DarkbeldinD 1 Reply Last reply Reply Quote 0
                • DarkbeldinD Offline
                  Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                  last edited by

                  @alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    alejandro-anv @alejandro-anv
                    last edited by alejandro-anv

                    More investigations about this (may be this helps finding the cause of the problem)

                    I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

                    # vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                    

                    I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

                    It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

                    In SMlog I see this;

                    Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                    Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
                    Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
                    Apr 20 11:34:42 peach SM: [15258]  = 0
                    Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                    Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                    Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
                    Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                    Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
                    Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
                    Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                    Apr 20 11:34:48 peach SM: [15284]  = 0
                    Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                    

                    But I check and see:

                    # vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                    20480
                    9451459072
                    /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
                    hidden: 0
                    chain depth: 30
                    

                    May be it's a bug? looks like the coalesce process ends but does nothing...

                    1 Reply Last reply Reply Quote 0
                    • A Offline
                      alejandro-anv @Darkbeldin
                      last edited by alejandro-anv

                      @Darkbeldin I did it already restarted toolstack. Didn't help.

                      DarkbeldinD 1 Reply Last reply Reply Quote 0
                      • DarkbeldinD Offline
                        Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                        last edited by

                        @alejandro-anv Do you have enough free space on your SR ?

                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          alejandro-anv @Darkbeldin
                          last edited by

                          @Darkbeldin said in Snapshot chain too long in NFS SR:

                          Do you have enough free space on your SR ?

                          Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

                          1 Reply Last reply Reply Quote 0
                          • A Offline
                            alejandro-anv @Darkbeldin
                            last edited by

                            @Darkbeldin said in Snapshot chain too long in NFS SR:

                            You should take a look into XOA because you probably have VDI to coalesce stuck there.

                            How do I check this?

                            DarkbeldinD 1 Reply Last reply Reply Quote 0
                            • DarkbeldinD Offline
                              Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                              last edited by

                              @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                              A 1 Reply Last reply Reply Quote 0
                              • A Offline
                                alejandro-anv @Darkbeldin
                                last edited by alejandro-anv

                                @Darkbeldin said in Snapshot chain too long in NFS SR:

                                @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                                Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

                                DarkbeldinD AnonabharA 2 Replies Last reply Reply Quote 0
                                • DarkbeldinD Offline
                                  Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                                  last edited by

                                  @alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

                                  1 Reply Last reply Reply Quote 0
                                  • AnonabharA Offline
                                    Anonabhar @alejandro-anv
                                    last edited by

                                    @alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

                                    One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

                                    Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

                                    When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

                                    tail -f /var/log/SMlog | grep SMGC
                                    

                                    Its boring.. But.. it gives me a bit of comfort between prayers 8-)

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post