XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Snapshot chain too long in NFS SR

    Scheduled Pinned Locked Moved Compute
    17 Posts 3 Posters 1.6k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      alejandro-anv
      last edited by

      @Darkbeldin said in Snapshot chain too long in NFS SR:

      @alejandro-anv
      Mostly you have a mount stuck on your SR that block the coalesce process.
      You should try to see why this mount is stuck.

      Sorry but I don't understand what you mean by a mount stuck? You mean it can be a problem with the SR mount? I've checked the mount point and I can ls and get info about files without problems. I had network problems with this SR but just now it's working.

      DarkbeldinD 1 Reply Last reply Reply Quote 0
      • DarkbeldinD Offline
        Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
        last edited by Darkbeldin

        @alejandro-anv From the first log line it seems the coalesce process trying to unmount

        /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

        But i can be wrong.

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          alejandro-anv @Darkbeldin
          last edited by

          @Darkbeldin said in Snapshot chain too long in NFS SR:

          @alejandro-anv From the first log line it seems the coalesce process trying to unmount

          /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd) failed

          But i can be wrong.

          I see this file no longer exists...

          ls: cannot access /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/088f49c5-07c0-483f-ae13-d4a9306e9c8b.vhd: No such file or
          ``` directory
          DarkbeldinD 1 Reply Last reply Reply Quote 0
          • DarkbeldinD Offline
            Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
            last edited by

            @alejandro-anv You can try restarting the toolstack on your host first to see if it helps.

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              alejandro-anv @alejandro-anv
              last edited by alejandro-anv

              More investigations about this (may be this helps finding the cause of the problem)

              I run manually vhd-util coalesce --debug -p -n xxxxxxx.vhd

              # vhd-util coalesce --debug -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
              

              I check the file descriptors of the process and I see it's using some files. Mainly, it keeps open /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd (which is the target of the coalesce process) and /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd but it also opens and closes other files (probably the components of the chain).

              It ends without error, but both files (of the same size) are kept and the original one keeps reporting the chain depth of 30. The second file is marked ad hidden and shows a depth of 29.

              In SMlog I see this;

              Apr 20 11:34:42 peach SM: [15258] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              Apr 20 11:34:42 peach SM: [15258] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              Apr 20 11:34:42 peach SM: [15258] Pause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
              Apr 20 11:34:42 peach SM: [15258] Calling tap pause with minor 12
              Apr 20 11:34:42 peach SM: [15258] ['/usr/sbin/tap-ctl', 'pause', '-p', '32637', '-m', '12']
              Apr 20 11:34:42 peach SM: [15258]  = 0
              Apr 20 11:34:42 peach SM: [15258] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              Apr 20 11:34:46 peach SM: [15284] lock: acquired /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              Apr 20 11:34:46 peach SM: [15284] Unpause for 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef
              Apr 20 11:34:46 peach SM: [15284] Realpath: /var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
              Apr 20 11:34:46 peach SM: [15284] lock: opening lock file /var/lock/sm/10410bc3-b762-0b99-6a0b-e61b091de848/sr
              Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/td-util', 'query', 'vhd', '-vpfb', '/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
              Apr 20 11:34:46 peach SM: [15284]   pread SUCCESS
              Apr 20 11:34:46 peach SM: [15284] Calling tap unpause with minor 12
              Apr 20 11:34:46 peach SM: [15284] ['/usr/sbin/tap-ctl', 'unpause', '-p', '32637', '-m', '12', '-a', 'vhd:/var/run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd']
              Apr 20 11:34:48 peach SM: [15284]  = 0
              Apr 20 11:34:48 peach SM: [15284] lock: released /var/lock/sm/1a1a7f05-220b-43cd-9337-2e6ebd5a43ef/vdi
              

              But I check and see:

              # vhd-util query -vsfd -p -n 1a1a7f05-220b-43cd-9337-2e6ebd5a43ef.vhd
              20480
              9451459072
              /run/sr-mount/10410bc3-b762-0b99-6a0b-e61b091de848/2d2fb989-6353-42fa-be07-c6e4d6c87cd6.vhd
              hidden: 0
              chain depth: 30
              

              May be it's a bug? looks like the coalesce process ends but does nothing...

              1 Reply Last reply Reply Quote 0
              • A Offline
                alejandro-anv @Darkbeldin
                last edited by alejandro-anv

                @Darkbeldin I did it already restarted toolstack. Didn't help.

                DarkbeldinD 1 Reply Last reply Reply Quote 0
                • DarkbeldinD Offline
                  Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                  last edited by

                  @alejandro-anv Do you have enough free space on your SR ?

                  A 1 Reply Last reply Reply Quote 0
                  • A Offline
                    alejandro-anv @Darkbeldin
                    last edited by

                    @Darkbeldin said in Snapshot chain too long in NFS SR:

                    Do you have enough free space on your SR ?

                    Yes. The SR is a Nas with 16Tb and it has 4Tb available (76% used).

                    1 Reply Last reply Reply Quote 0
                    • A Offline
                      alejandro-anv @Darkbeldin
                      last edited by

                      @Darkbeldin said in Snapshot chain too long in NFS SR:

                      You should take a look into XOA because you probably have VDI to coalesce stuck there.

                      How do I check this?

                      DarkbeldinD 1 Reply Last reply Reply Quote 0
                      • DarkbeldinD Offline
                        Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                        last edited by

                        @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                        A 1 Reply Last reply Reply Quote 0
                        • A Offline
                          alejandro-anv @Darkbeldin
                          last edited by alejandro-anv

                          @Darkbeldin said in Snapshot chain too long in NFS SR:

                          @alejandro-anv Get on your SR tab in advanced you should see the coalesce needed if you have some. You can also check dashboard> health panel if your on a recent version.

                          Yes. It shows vdi to coalesce in ALL the SRs I have. The problem is why it's not doing it by itself and how I force it to be done...

                          DarkbeldinD AnonabharA 2 Replies Last reply Reply Quote 0
                          • DarkbeldinD Offline
                            Darkbeldin Vates 🪐 Pro Support Team @alejandro-anv
                            last edited by

                            @alejandro-anv You can try to rescan the SR from XOA but mostly you have something stuck and it's blocking your coalesce, unfortunately this is not easy to solve sometimes you can copy the VM that causing the issue to create a new chain but with your error i'm not sure what's the real issue behind it.

                            1 Reply Last reply Reply Quote 0
                            • AnonabharA Offline
                              Anonabhar @alejandro-anv
                              last edited by

                              @alejandro-anv I have had problems like this on my NFS and iSCSI SR's previously.

                              One thing that might help is to shutdown the VM's before doing a re-scan. If my memory serves me, it does a different kind of coalesce (online vrs offline) and this has a better chance of success.

                              Also, be aware that from the time you re-scan and the time it actually starts to do work on the drives is 5 minutes.

                              When I have a problem like this, I normally ssh into the pool master and tail the logfiles to watch it work. IE:

                              tail -f /var/log/SMlog | grep SMGC
                              

                              Its boring.. But.. it gives me a bit of comfort between prayers 8-)

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post