XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Snapshot chain too long in NFS SR

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.6k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DarkbeldinD Offline
      Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
      last edited by

      @alejandro-anv
      Mostly you have a mount stuck on your SR that block the coalesce process.
      You should try to see why this mount is stuck.

      1 Reply Last reply Reply Quote 0
      • A Offline
        alejandro-anv
        last edited by

        @Darkbeldin said in Snapshot chain too long in NFS SR:

        @alejandro-anv
        Mostly you have a mount stuck on your SR that block the coalesce process.
        You should try to see why this mount is stuck.

        Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

        DarkbeldinD 1 Reply Last reply Reply Quote 0
        • DarkbeldinD Offline
          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
          last edited by Darkbeldin

          @alejandro-anv From the first log line it seems the coalesce process trying to unmount

          /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

          But i can be wrong.

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            alejandro-anv @Darkbeldin
            last edited by

            @Darkbeldin said in Snapshot chain too long in NFS SR:

            @alejandro-anv From the first log line it seems the coalesce process trying to unmount

            /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

            But i can be wrong.

            I see this file no longer exists...

            ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
            ``` directory
            DarkbeldinD 1 Reply Last reply Reply Quote 0
            • DarkbeldinD Offline
              Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
              last edited by

              @alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                alejandro-anv @alejandro-anv
                last edited by alejandro-anv

                More investigations about this (may be this helps finding the cause of the problem)

                I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

                # vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                

                I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

                It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

                In SMlog I see this;

                Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
                Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
                Apr 20 11:34:42 peach SM: [15258]  = 0
                Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
                Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
                Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
                Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
                Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
                Apr 20 11:34:48 peach SM: [15284]  = 0
                Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
                

                But I check and see:

                # vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
                20480
                9451459072
                /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
                hidden: 0
                chain depth: 30
                

                May be it's a bug? looks like the coalesce process ends but does nothing...

                1 Reply Last reply Reply Quote 0
                • A Offline
                  alejandro-anv @Darkbeldin
                  last edited by alejandro-anv

                  @Darkbeldin I did it already restarted toolstack. Didn't help.

                  DarkbeldinD 1 Reply Last reply Reply Quote 0
                  • DarkbeldinD Offline
                    Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                    last edited by

                    @alejandro-anv Do you have enough free space on your SR ?

                    A 1 Reply Last reply Reply Quote 0
                    • A Offline
                      alejandro-anv @Darkbeldin
                      last edited by

                      @Darkbeldin said in Snapshot chain too long in NFS SR:

                      Do you have enough free space on your SR ?

                      Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        alejandro-anv @Darkbeldin
                        last edited by

                        @Darkbeldin said in Snapshot chain too long in NFS SR:

                        You should take a look into XOA because you probably have VDI to coalesce stuck there.

                        How do I check this?

                        DarkbeldinD 1 Reply Last reply Reply Quote 0
                        • DarkbeldinD Offline
                          Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                          last edited by

                          @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            alejandro-anv @Darkbeldin
                            last edited by alejandro-anv

                            @Darkbeldin said in Snapshot chain too long in NFS SR:

                            @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                            Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

                            DarkbeldinD AnonabharA 2 Replies Last reply Reply Quote 0
                            • DarkbeldinD Offline
                              Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                              last edited by

                              @alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

                              1 Reply Last reply Reply Quote 0
                              • AnonabharA Offline
                                Anonabhar @alejandro-anv
                                last edited by

                                @alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

                                One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

                                Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

                                When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

                                tail -f /var/log/SMlog | grep SMGC
                                

                                Its boring.. But.. it gives me a bit of comfort between prayers 8-)

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post