XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Snapshot chain too long in NFS SR

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.7k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DarkbeldinD Offline
      Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
      last edited by Darkbeldin

      @alejandro-anv From the first log line it seems the coalesce process trying to unmount

      /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

      But i can be wrong.

      A 1 Reply Last reply Reply Quote 0
      • A Offline
        alejandro-anv @Darkbeldin
        last edited by

        @Darkbeldin said in Snapshot chain too long in NFS SR:

        @alejandro-anv From the first log line it seems the coalesce process trying to unmount

        /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

        But i can be wrong.

        I see this file no longer exists...

        ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
        ``` directory
        DarkbeldinD 1 Reply Last reply Reply Quote 0
        • DarkbeldinD Offline
          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
          last edited by

          @alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            alejandro-anv @alejandro-anv
            last edited by alejandro-anv

            More investigations about this (may be this helps finding the cause of the problem)

            I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

            # vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
            

            I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

            It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

            In SMlog I see this;

            Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
            Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
            Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
            Apr 20 11:34:42 peach SM: [15258]  = 0
            Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
            Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
            Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
            Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
            Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
            Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
            Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
            Apr 20 11:34:48 peach SM: [15284]  = 0
            Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
            

            But I check and see:

            # vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
            20480
            9451459072
            /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
            hidden: 0
            chain depth: 30
            

            May be it's a bug? looks like the coalesce process ends but does nothing...

            1 Reply Last reply Reply Quote 0
            • A Offline
              alejandro-anv @Darkbeldin
              last edited by alejandro-anv

              @Darkbeldin I did it already restarted toolstack. Didn't help.

              DarkbeldinD 1 Reply Last reply Reply Quote 0
              • DarkbeldinD Offline
                Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                last edited by

                @alejandro-anv Do you have enough free space on your SR ?

                A 1 Reply Last reply Reply Quote 0
                • A Offline
                  alejandro-anv @Darkbeldin
                  last edited by

                  @Darkbeldin said in Snapshot chain too long in NFS SR:

                  Do you have enough free space on your SR ?

                  Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

                  1 Reply Last reply Reply Quote 0
                  • A Offline
                    alejandro-anv @Darkbeldin
                    last edited by

                    @Darkbeldin said in Snapshot chain too long in NFS SR:

                    You should take a look into XOA because you probably have VDI to coalesce stuck there.

                    How do I check this?

                    DarkbeldinD 1 Reply Last reply Reply Quote 0
                    • DarkbeldinD Offline
                      Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                      last edited by

                      @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                      A 1 Reply Last reply Reply Quote 0
                      • A Offline
                        alejandro-anv @Darkbeldin
                        last edited by alejandro-anv

                        @Darkbeldin said in Snapshot chain too long in NFS SR:

                        @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                        Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

                        DarkbeldinD AnonabharA 2 Replies Last reply Reply Quote 0
                        • DarkbeldinD Offline
                          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                          last edited by

                          @alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

                          1 Reply Last reply Reply Quote 0
                          • AnonabharA Offline
                            Anonabhar @alejandro-anv
                            last edited by

                            @alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

                            One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

                            Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

                            When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

                            tail -f /var/log/SMlog | grep SMGC
                            

                            Its boring.. But.. it gives me a bit of comfort between prayers 8-)

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post