Issue with SR and coalesce
-
If you don't have support, it's likely not too critical then
When you search the logs, have you checked the SMlog for coalesce exceptions or errors?
-
@olivierlambert
no, there is no coalesce errors in SMlog.the only error in log is "SMGC: [23240] * * * * * SR 3a5b6e28-15e0-a173-d61e-cf98335bc2b9: ERROR
Feb 19 12:34:39 SMGC: [2234] gc: EXCEPTION <class 'XenAPI.Failure'>, ['XENAPI_PLUGIN_FAILURE', 'multi', 'CommandException', 'Input/output error']" -
Have you tried to restart the hosts?
-
@olivierlambert
not yet, restarting will force coalesce? -
@olivierlambert restarted the master, nothing happened
-
It's really hard to tell, have you restarted also all the other pool members?
-
We are having this exact same issue and I have posted in the Discord server to no avail
Mar 5 10:05:57 ops-xen2 SMGC: [25218] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~* Mar 5 10:05:57 ops-xen2 SMGC: [25218] *********************** Mar 5 10:05:57 ops-xen2 SMGC: [25218] * E X C E P T I O N * Mar 5 10:05:57 ops-xen2 SMGC: [25218] *********************** Mar 5 10:05:57 ops-xen2 SMGC: [25218] gc: EXCEPTION <class 'XenAPI.Failure'>, ['XENAPI_PLUGIN_FAILURE', 'multi', 'CommandException', 'Input/output error'] Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 2961, in gc Mar 5 10:05:57 ops-xen2 SMGC: [25218] _gc(None, srUuid, dryRun) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 2846, in _gc Mar 5 10:05:57 ops-xen2 SMGC: [25218] _gcLoop(sr, dryRun) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 2813, in _gcLoop Mar 5 10:05:57 ops-xen2 SMGC: [25218] sr.garbageCollect(dryRun) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 1651, in garbageCollect Mar 5 10:05:57 ops-xen2 SMGC: [25218] self.deleteVDIs(vdiList) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 1665, in deleteVDIs Mar 5 10:05:57 ops-xen2 SMGC: [25218] self.deleteVDI(vdi) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 2426, in deleteVDI Mar 5 10:05:57 ops-xen2 SMGC: [25218] self._checkSlaves(vdi) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 2650, in _checkSlaves Mar 5 10:05:57 ops-xen2 SMGC: [25218] self.xapi.ensureInactive(hostRef, args) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/opt/xensource/sm/cleanup.py", line 332, in ensureInactive Mar 5 10:05:57 ops-xen2 SMGC: [25218] hostRef, self.PLUGIN_ON_SLAVE, "multi", args) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 264, in __call__ Mar 5 10:05:57 ops-xen2 SMGC: [25218] return self.__send(self.__name, args) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 160, in xenapi_request Mar 5 10:05:57 ops-xen2 SMGC: [25218] result = _parse_result(getattr(self, methodname)(*full_params)) Mar 5 10:05:57 ops-xen2 SMGC: [25218] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 238, in _parse_result Mar 5 10:05:57 ops-xen2 SMGC: [25218] raise Failure(result['ErrorDescription']) Mar 5 10:05:57 ops-xen2 SMGC: [25218] Mar 5 10:05:57 ops-xen2 SMGC: [25218] *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*
-
This seems to be a Python code error ? Could this be a bug in the GC script ?
-
Just updated a day ago,
All of these backups that are failing have no existing Snapshots
This seems to be because each one has 3 (one has 4) base copies as its not coalescing
Output of grep -A 5 -B 5 -i exception /var/log/SMlog
-
@olivierlambert said in Issue with SR and coalesce:
It's really hard to tell, have you restarted also all the other pool members?
It resolved after i migrate from the SR to another (or to XCP host local storage). But we have scheduled backup jobs running everyday and i'm noticing the VDI to coalesce number is growing up again on these storages.
@Byte_Smarter said in Issue with SR and coalesce:
This seems to be a Python code error ? Could this be a bug in the GC script ?
I think so
-
Hi, this XAPI plugin
multi
is called on another host but is failing with IOError.
It's doing a few things on a host related to LVM handling.
It's failing on one of them, you should look into the one having the error to have the full error in SMlog of the host.
The plugin itself is located in/etc/xapi.d/plugins/on-slave
, it's the function namedmulti
. -
As replied by @dthenot you need to check /var/log/SMlog on all of your hosts to see which one it is failing on and why.
If the storage filled up before this started to happend my guess is that there is something corrupted, if that's the case you might have to clean up manually.I've had this situation once and got help from XOA support, they had to manually clean up some old snapshots and after doing so we triggered a new coalescale (rescan the storage) which were able to clean up the queue.
Untill that's finished I wouldn't run any backups, since that might cause more problems but also slow down the coalescale process.