How to fix XOSTOR
-
Hi,
For unknown reason my master server crashed and I managed to restore the service by following this guide
My VMs are working again but my XOSTOR storage is no longer working.
Do I need to set the HA back on again before I can start using XOSTOR again with :
xe pool-ha-enable heartbeat-sr-uuids=<UUID>
tail -n 500 /var/log/SMlog -f
return the following:Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr Apr 27 13:23:36 uk SM: [30709] sr_attach {'sr_uuid': 'a20ee08c-40d0-9818-084f-282bbca1f217', 'subtask_of': 'DummyRef:|87739718-a444-4fa0-899e-73b9387541fa|SR.attach', 'args': [], 'host_ref': 'OpaqueRef:359a920d-7bb1-4088-8b3e-42254f111f51', 'session_ref': 'OpaqueRef:c8f6a286-72a9-476e-aaf6-f59a2651662f', 'device_config': {'group-name': 'linstor_group/thin_device', 'redundancy': '3', 'hosts': 'uk.dc1.xcp-ng-hyper1,uk.dc1.xcp-ng-hyper2,uk.dc1.xcp-ng-hyper3,uk.dc1.xcp-ng-hyper4', 'SRmaster': 'false', 'provisioning': 'thin'}, 'command': 'sr_attach', 'sr_ref': 'OpaqueRef:f62acb08-116b-42e4-90df-e7d2153ed610', 'local_cache_sr': '28b8eb58-a6a2-c2fa-ad1e-b339b531330f'} Apr 27 13:23:36 uk SMGC: [30709] === SR a20ee08c-40d0-9818-084f-282bbca1f217: abort === Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active Apr 27 13:23:36 uk SM: [30709] lock: tried lock /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active, acquired: True (exists: True) Apr 27 13:23:36 uk SMGC: [30709] abort: releasing the process lock Apr 27 13:23:36 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:23:36 uk SM: [30709] RESET for SR a20ee08c-40d0-9818-084f-282bbca1f217 (master: False) Apr 27 13:23:36 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/.nil/lvm Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/.nil/lvm Apr 27 13:23:36 uk SM: [30709] ['/sbin/vgchange', '-ay', 'linstor_group'] Apr 27 13:23:37 uk SM: [30709] pread SUCCESS Apr 27 13:23:37 uk SM: [30709] lock: released /var/lock/sm/.nil/lvm Apr 27 13:25:21 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 0 Apr 27 13:25:47 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 1 Apr 27 13:26:13 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 2 Apr 27 13:26:46 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 3 Apr 27 13:27:19 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 4 Apr 27 13:27:59 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 5 Apr 27 13:28:58 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 6 Apr 27 13:29:26 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 7 Apr 27 13:30:22 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 8 Apr 27 13:30:48 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 9 Apr 27 13:31:18 uk SM: [30709] Raising exception [47, The SR is not available [opterr=Unable to find controller uri...]] Apr 27 13:31:18 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr Apr 27 13:31:18 uk SM: [30709] ***** generic exception: sr_attach: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to find controller uri...] Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 27 13:31:18 uk SM: [30709] return self._run_locked(sr) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 27 13:31:18 uk SM: [30709] rv = self._run(sr, target) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 352, in _run Apr 27 13:31:18 uk SM: [30709] return sr.attach(sr_uuid) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/LinstorSR", line 634, in wrap Apr 27 13:31:18 uk SM: [30709] return load(self, *args, **kwargs) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/LinstorSR", line 560, in load Apr 27 13:31:18 uk SM: [30709] raise xs_errors.XenError('SRUnavailable', opterr=str(e)) Apr 27 13:31:18 uk SM: [30709] Apr 27 13:31:18 uk SM: [30709] ***** LINSTOR resources on XCP-ng: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to find controller uri...] Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 378, in run Apr 27 13:31:18 uk SM: [30709] ret = cmd.run(sr) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 27 13:31:18 uk SM: [30709] return self._run_locked(sr) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 27 13:31:18 uk SM: [30709] rv = self._run(sr, target) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/SRCommand.py", line 352, in _run Apr 27 13:31:18 uk SM: [30709] return sr.attach(sr_uuid) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/LinstorSR", line 634, in wrap Apr 27 13:31:18 uk SM: [30709] return load(self, *args, **kwargs) Apr 27 13:31:18 uk SM: [30709] File "/opt/xensource/sm/LinstorSR", line 560, in load Apr 27 13:31:18 uk SM: [30709] raise xs_errors.XenError('SRUnavailable', opterr=str(e)) Apr 27 13:31:18 uk SM: [30709] Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr Apr 27 13:35:45 uk SM: [8730] sr_attach {'sr_uuid': 'a20ee08c-40d0-9818-084f-282bbca1f217', 'subtask_of': 'DummyRef:|f4bcf18b-9c34-42b7-a933-47cb88c9066e|SR.attach', 'args': [], 'host_ref': 'OpaqueRef:359a920d-7bb1-4088-8b3e-42254f111f51', 'session_ref': 'OpaqueRef:2ec5ec53-65f6-4ab3-a626-bdd87e9df0e4', 'device_config': {'group-name': 'linstor_group/thin_device', 'redundancy': '3', 'hosts': 'uk.dc1.xcp-ng-hyper1,uk.dc1.xcp-ng-hyper2,uk.dc1.xcp-ng-hyper3,uk.dc1.xcp-ng-hyper4', 'SRmaster': 'false', 'provisioning': 'thin'}, 'command': 'sr_attach', 'sr_ref': 'OpaqueRef:f62acb08-116b-42e4-90df-e7d2153ed610', 'local_cache_sr': '28b8eb58-a6a2-c2fa-ad1e-b339b531330f'} Apr 27 13:35:45 uk SMGC: [8730] === SR a20ee08c-40d0-9818-084f-282bbca1f217: abort === Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active Apr 27 13:35:45 uk SM: [8730] lock: tried lock /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active, acquired: True (exists: True) Apr 27 13:35:45 uk SMGC: [8730] abort: releasing the process lock Apr 27 13:35:45 uk SM: [8730] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:35:45 uk SM: [8730] RESET for SR a20ee08c-40d0-9818-084f-282bbca1f217 (master: False) Apr 27 13:35:45 uk SM: [8730] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/.nil/lvm Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/.nil/lvm Apr 27 13:35:45 uk SM: [8730] ['/sbin/vgchange', '-ay', 'linstor_group'] Apr 27 13:35:46 uk SM: [8730] pread SUCCESS Apr 27 13:35:46 uk SM: [8730] lock: released /var/lock/sm/.nil/lvm Apr 27 13:38:07 uk SM: [8730] Got exception: Unable to find controller uri.... Retry number: 0
-
@ronan-a Could you please help me please?
-
@fred974 it looks like the logs say it cant find the controller can you confirm if it is running on any of your hosts.
im no expert but its a start
-
@brodiecyber Thank you for your help. I ran the command but got this message
[16:28 uk ~]# xe host-call-plugin host-uuid=5a1e10ec-4f1a-469d-d5h7-adb8535741ca plugin=linstor-manager fn=has-ControllerRunning Error code: UNKNOWN_XENAPI_PLUGIN_FUNCTION Error parameters: has-ControllerRunning
-
@brodiecyber I found the correct cmd..
[16:45 uk ~]# xe host-call-plugin host-uuid=5a1e10ec-4f1a-469d-d5h7-adb8535741ca plugin=linstor-manager fn=hasControllerRunning False
linstor resource list
[16:49 uk ~]# linstor resource list Error: Unable to connect to linstor://localhost:3370: [Errno 99] Cannot assign requested address
-
So we know its not running to that host as the command is targeting a single hosts uuid.
have you run it on all your XCP-ng servers to confirm if the controller is running on any of them. perhaps check to make sure the lunstor controller is available on any host
-
@brodiecyber yes, it says 'False' on all 4x hosts. The controller isn't running on any hosts
-
ok so thats our problem the lunstor controller is not running on any node so XOSTOR has no way to initialize. Im going to do some reading and see what comes up
-
Maybe also post in the XOSTOR thread
A controller failure should be on the radar for possible failure scenarios.
https://xcp-ng.org/forum/topic/5361/xostor-hyperconvergence-preview -
also see if this script is available on the last host that has the controller which we can assume is the host that failed the first time. Maybe it has some data on why the controller isn't initializing
-
@brodiecyber I can see it exist in
/opt/xensource/bin/linstor-kv-tool
but no idea on how to use it -