XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Issue with SR and coalesce

    Scheduled Pinned Locked Moved Unsolved Backup
    62 Posts 7 Posters 1.1k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L Offline
      lucasljorge
      last edited by

      I was looking into the forum for a solution, but i dont find it...

      A few days ago, our SR here in the company filled up completely, causing a problem executing the scheduled backups jobs in our routine. We find out the problem and add some space in out storage to run the jobs again, but we are getting a "Job canceled to protect the VDI chain". Also there is a list of VDI to coalesce that only goes up and now is in 160. Some VMs are getting backuped, but the majority are getting that job canceled to protect VDI chain error.

      Searching for the logs, i cant find any coalesce running in the XCP. I read that when rescan the SR the XCP does the coalesce automatically, but i cant find the process running out.

      What can i do to run all the backup jobs sucessfully again? We have to stop all backup jobs and wait the coalesce to run? How can we track this?

      Thanks in advance

      tjkreidlT 1 Reply Last reply Reply Quote 0
      • olivierlambertO Online
        olivierlambert Vates 🪐 Co-Founder CEO
        last edited by

        Hi,

        1. If you have pro support, please open a ticket, that would be easier to assist you 🙂
        2. We have 0 info on your XCP-ng version, up to date or not etc.
        L 1 Reply Last reply Reply Quote 0
        • L Offline
          lucasljorge @olivierlambert
          last edited by

          @olivierlambert
          We dont have pro support 😞
          Our version is 8.2

          1 Reply Last reply Reply Quote 0
          • olivierlambertO Online
            olivierlambert Vates 🪐 Co-Founder CEO
            last edited by

            If you don't have support, it's likely not too critical then 🙂

            When you search the logs, have you checked the SMlog for coalesce exceptions or errors?

            L 1 Reply Last reply Reply Quote 0
            • L Offline
              lucasljorge @olivierlambert
              last edited by

              @olivierlambert
              no, there is no coalesce errors in SMlog.

              the only error in log is "SMGC: [23240] * * * * * SR 3a5b6e28-15e0-a173-d61e-cf98335bc2b9: ERROR
              Feb 19 12:34:39 SMGC: [2234] gc: EXCEPTION <class 'XenAPI.Failure'>, ['XENAPI_PLUGIN_FAILURE', 'multi', 'CommandException', 'Input/output error']"

              1 Reply Last reply Reply Quote 0
              • olivierlambertO Online
                olivierlambert Vates 🪐 Co-Founder CEO
                last edited by

                Have you tried to restart the hosts?

                L 2 Replies Last reply Reply Quote 0
                • L Offline
                  lucasljorge @olivierlambert
                  last edited by

                  @olivierlambert
                  not yet, restarting will force coalesce?

                  1 Reply Last reply Reply Quote 0
                  • L Offline
                    lucasljorge @olivierlambert
                    last edited by

                    @olivierlambert restarted the master, nothing happened

                    1 Reply Last reply Reply Quote 0
                    • olivierlambertO Online
                      olivierlambert Vates 🪐 Co-Founder CEO
                      last edited by

                      It's really hard to tell, have you restarted also all the other pool members?

                      L 1 Reply Last reply Reply Quote 0
                      • Byte_SmarterB Offline
                        Byte_Smarter
                        last edited by

                        We are having this exact same issue and I have posted in the Discord server to no avail

                        Mar  5 10:05:57 ops-xen2 SMGC: [25218] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]          ***********************
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]          *  E X C E P T I O N  *
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]          ***********************
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218] gc: EXCEPTION <class 'XenAPI.Failure'>, ['XENAPI_PLUGIN_FAILURE', 'multi', 'CommandException', 'Input/output error']
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 2961, in gc
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     _gc(None, srUuid, dryRun)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 2846, in _gc
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     _gcLoop(sr, dryRun)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 2813, in _gcLoop
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     sr.garbageCollect(dryRun)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 1651, in garbageCollect
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     self.deleteVDIs(vdiList)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 1665, in deleteVDIs
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     self.deleteVDI(vdi)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 2426, in deleteVDI
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     self._checkSlaves(vdi)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 2650, in _checkSlaves
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     self.xapi.ensureInactive(hostRef, args)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/opt/xensource/sm/cleanup.py", line 332, in ensureInactive
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     hostRef, self.PLUGIN_ON_SLAVE, "multi", args)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/usr/lib/python2.7/site-packages/XenAPI.py", line 264, in __call__
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     return self.__send(self.__name, args)
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/usr/lib/python2.7/site-packages/XenAPI.py", line 160, in xenapi_request
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     result = _parse_result(getattr(self, methodname)(*full_params))
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]   File "/usr/lib/python2.7/site-packages/XenAPI.py", line 238, in _parse_result
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]     raise Failure(result['ErrorDescription'])
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218]
                        Mar  5 10:05:57 ops-xen2 SMGC: [25218] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
                        
                        1 Reply Last reply Reply Quote 0
                        • Byte_SmarterB Offline
                          Byte_Smarter
                          last edited by

                          This seems to be a Python code error ? Could this be a bug in the GC script ?

                          1 Reply Last reply Reply Quote 0
                          • Byte_SmarterB Offline
                            Byte_Smarter
                            last edited by

                            1672cf37-8a46-4a55-8d62-e3b680f3fdc8-image.png

                            Just updated a day ago,

                            All of these backups that are failing have no existing Snapshots

                            e11414d7-6884-4e16-9f41-bfad0938dc66-image.png

                            This seems to be because each one has 3 (one has 4) base copies as its not coalescing

                            Output of grep -A 5 -B 5 -i exception /var/log/SMlog
                            2c06a306-da2b-4369-8a90-4c9d3216dcc1-image.png

                            1 Reply Last reply Reply Quote 0
                            • L Offline
                              lucasljorge @olivierlambert
                              last edited by

                              @olivierlambert said in Issue with SR and coalesce:

                              It's really hard to tell, have you restarted also all the other pool members?

                              It resolved after i migrate from the SR to another (or to XCP host local storage). But we have scheduled backup jobs running everyday and i'm noticing the VDI to coalesce number is growing up again on these storages.

                              @Byte_Smarter said in Issue with SR and coalesce:

                              This seems to be a Python code error ? Could this be a bug in the GC script ?

                              I think so

                              Byte_SmarterB 1 Reply Last reply Reply Quote 0
                              • dthenotD Online
                                dthenot Vates 🪐 XCP-ng Team
                                last edited by dthenot

                                Hi, this XAPI plugin multi is called on another host but is failing with IOError.
                                It's doing a few things on a host related to LVM handling.
                                It's failing on one of them, you should look into the one having the error to have the full error in SMlog of the host.
                                The plugin itself is located in /etc/xapi.d/plugins/on-slave, it's the function named multi.

                                nikadeN 1 Reply Last reply Reply Quote 0
                                • nikadeN Offline
                                  nikade Top contributor @dthenot
                                  last edited by

                                  As replied by @dthenot you need to check /var/log/SMlog on all of your hosts to see which one it is failing on and why.
                                  If the storage filled up before this started to happend my guess is that there is something corrupted, if that's the case you might have to clean up manually.

                                  I've had this situation once and got help from XOA support, they had to manually clean up some old snapshots and after doing so we triggered a new coalescale (rescan the storage) which were able to clean up the queue.
                                  Untill that's finished I wouldn't run any backups, since that might cause more problems but also slow down the coalescale process.

                                  Byte_SmarterB 1 Reply Last reply Reply Quote 0
                                  • Byte_SmarterB Offline
                                    Byte_Smarter @nikade
                                    last edited by

                                    @nikade

                                    In our case there are 0 Snapshots as these are backups restored on a new SR, so I am not sure what there is to coalesce.

                                    We get this when running: grep -i "coalescale" /var/log/SMlog

                                    272fd752-73f8-4ccb-89fb-74541128184d-image.png

                                    We get this when running: grep -A 5 -B 5 -i exception /var/log/SMlog (Was advised from Discord to find issues with Coalescing)

                                    ac64d18c-f122-4c9a-b9f4-1bc4ad3d8b0e-image.png

                                    As far as all the hosts ONLY the master which is 'ops-xen2' is showing the logs errors posted earlier.

                                    In theory the only snapshots these should have are the ones taken during the backup process but it never makes these and skips the VMs we want backed up that are the issue here

                                    The SR has several TB of free space and is using ISCSI and is running on its own storage network, along with the fact no VMs running on that SR are showing any issues with the SR IOPS. I am not thinking this is a SR related issues (I can be wrong here)

                                    2a6867fc-2050-4036-9aad-0947352f3b21-image.png

                                    As far as what was posted about the Multi plugin while informative what should I be looking for in the entirety of the SMLog?

                                    These host are running version 8.1.0 and have been running for a decent time so that could be a issue here, we are looking at possibly moving all VMs that are at issue to a totally new host/pool to see if the hosts are the issue.

                                    There are 10 VM's in this pool 5 of them are on a older NAS that we are wanting to replace we have not moved any more over as this issue kind of makes us not want to finish moving them all over until we are secure. We have 5 on the same pool in that new SR I mentioned and from both speed and IOPS seem to be fine and no errors to be seen with the SR.

                                    If we shut down the 5 VM's that are currently an issue they can run the backups just fine but you can not run the backups if they are running.

                                    I don't mind doing the work to fix the issues, I am just still in the process of finding out what the issue that I need to fix is. I would really like if someone has 15 mins to go over the logs with us from XCP and maybe point us in some sort of direction to resolve the issues we are seeing as it does not seem to fall into issues that are in the documentation or the Discord.

                                    P 1 Reply Last reply Reply Quote 0
                                    • P Offline
                                      ph7 @Byte_Smarter
                                      last edited by

                                      @Byte_Smarter said in Issue with SR and coalesce:

                                      We get this when running: grep -i "coalescale" /var/log/SMlog

                                      Maybe try with coalesce instead

                                      Byte_SmarterB 2 Replies Last reply Reply Quote 0
                                      • Byte_SmarterB Offline
                                        Byte_Smarter @ph7
                                        last edited by

                                        @ph7 a99e4f73-4ee2-4425-bd5a-f5747a9693d4-image.png

                                        Same result

                                        was copied from:

                                        f6fe5001-9d1a-4c2e-8b44-5ecd729796ed-image.png

                                        1 Reply Last reply Reply Quote 0
                                        • Byte_SmarterB Offline
                                          Byte_Smarter @ph7
                                          last edited by

                                          @ph7

                                          Also ran on the other hosts in pool

                                          dde454c1-8297-4e19-9511-4e5333d99a22-image.png

                                          7d4bc42e-2e7a-481a-9c9c-258ab0b90b26-image.png

                                          I am assuming that this means its not even attempting to coalesce?

                                          P 1 Reply Last reply Reply Quote 0
                                          • P Offline
                                            ph7 @Byte_Smarter
                                            last edited by

                                            @Byte_Smarter
                                            Sorry, can't help You 😞

                                            Byte_SmarterB 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post