XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    How to fix XOSTOR

    Scheduled Pinned Locked Moved Xen Orchestra
    11 Posts 2 Posters 1.1k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • F Offline
      fred974
      last edited by

      Hi,

      For unknown reason my master server crashed and I managed to restore the service by following this guide
      My VMs are working again but my XOSTOR storage is no longer working.
      Do I need to set the HA back on again before I can start using XOSTOR again with :
      xe pool-ha-enable heartbeat-sr-uuids=<UUID>

      tail -n 500 /var/log/SMlog -f return the following:

      Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr
      Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr
      Apr 27 13:23:36 uk SM: [30709] sr_attach {'sr_uuid': 'a20ee08c-40d0-9818-084f-282bbca1f217', 'subtask_of': 'DummyRef:|87739718-a444-4fa0-899e-73b9387541fa|SR.attach', 'args': [], 'host_ref': 'OpaqueRef:359a920d-7bb1-4088-8b3e-42254f111f51', 'session_ref': 'OpaqueRef:c8f6a286-72a9-476e-aaf6-f59a2651662f', 'device_config': {'group-name': 'linstor_group/thin_device', 'redundancy': '3', 'hosts': 'uk.dc1.xcp-ng-hyper1,uk.dc1.xcp-ng-hyper2,uk.dc1.xcp-ng-hyper3,uk.dc1.xcp-ng-hyper4', 'SRmaster': 'false', 'provisioning': 'thin'}, 'command': 'sr_attach', 'sr_ref': 'OpaqueRef:f62acb08-116b-42e4-90df-e7d2153ed610', 'local_cache_sr': '28b8eb58-a6a2-c2fa-ad1e-b339b531330f'}
      Apr 27 13:23:36 uk SMGC: [30709] === SR a20ee08c-40d0-9818-084f-282bbca1f217: abort ===
      Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active
      Apr 27 13:23:36 uk SM: [30709] lock: tried lock /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active, acquired: True (exists: True)
      Apr 27 13:23:36 uk SMGC: [30709] abort: releasing the process lock
      Apr 27 13:23:36 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active
      Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:23:36 uk SM: [30709] RESET for SR a20ee08c-40d0-9818-084f-282bbca1f217 (master: False)
      Apr 27 13:23:36 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:23:36 uk SM: [30709] lock: opening lock file /var/lock/sm/.nil/lvm
      Apr 27 13:23:36 uk SM: [30709] lock: acquired /var/lock/sm/.nil/lvm
      Apr 27 13:23:36 uk SM: [30709] ['/sbin/vgchange', '-ay', 'linstor_group']
      Apr 27 13:23:37 uk SM: [30709]   pread SUCCESS
      Apr 27 13:23:37 uk SM: [30709] lock: released /var/lock/sm/.nil/lvm
      Apr 27 13:25:21 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 0
      Apr 27 13:25:47 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 1
      Apr 27 13:26:13 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 2
      Apr 27 13:26:46 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 3
      Apr 27 13:27:19 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 4
      Apr 27 13:27:59 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 5
      Apr 27 13:28:58 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 6
      Apr 27 13:29:26 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 7
      Apr 27 13:30:22 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 8
      Apr 27 13:30:48 uk SM: [30709] Got exception: Unable to find controller uri.... Retry number: 9
      Apr 27 13:31:18 uk SM: [30709] Raising exception [47, The SR is not available [opterr=Unable to find controller uri...]]
      Apr 27 13:31:18 uk SM: [30709] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr
      Apr 27 13:31:18 uk SM: [30709] ***** generic exception: sr_attach: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to find controller uri...]
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 110, in run
      Apr 27 13:31:18 uk SM: [30709]     return self._run_locked(sr)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked
      Apr 27 13:31:18 uk SM: [30709]     rv = self._run(sr, target)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 352, in _run
      Apr 27 13:31:18 uk SM: [30709]     return sr.attach(sr_uuid)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/LinstorSR", line 634, in wrap
      Apr 27 13:31:18 uk SM: [30709]     return load(self, *args, **kwargs)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/LinstorSR", line 560, in load
      Apr 27 13:31:18 uk SM: [30709]     raise xs_errors.XenError('SRUnavailable', opterr=str(e))
      Apr 27 13:31:18 uk SM: [30709]
      Apr 27 13:31:18 uk SM: [30709] ***** LINSTOR resources on XCP-ng: EXCEPTION <class 'SR.SROSError'>, The SR is not available [opterr=Unable to find controller uri...]
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 378, in run
      Apr 27 13:31:18 uk SM: [30709]     ret = cmd.run(sr)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 110, in run
      Apr 27 13:31:18 uk SM: [30709]     return self._run_locked(sr)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked
      Apr 27 13:31:18 uk SM: [30709]     rv = self._run(sr, target)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/SRCommand.py", line 352, in _run
      Apr 27 13:31:18 uk SM: [30709]     return sr.attach(sr_uuid)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/LinstorSR", line 634, in wrap
      Apr 27 13:31:18 uk SM: [30709]     return load(self, *args, **kwargs)
      Apr 27 13:31:18 uk SM: [30709]   File "/opt/xensource/sm/LinstorSR", line 560, in load
      Apr 27 13:31:18 uk SM: [30709]     raise xs_errors.XenError('SRUnavailable', opterr=str(e))
      Apr 27 13:31:18 uk SM: [30709]
      Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr
      Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/sr
      Apr 27 13:35:45 uk SM: [8730] sr_attach {'sr_uuid': 'a20ee08c-40d0-9818-084f-282bbca1f217', 'subtask_of': 'DummyRef:|f4bcf18b-9c34-42b7-a933-47cb88c9066e|SR.attach', 'args': [], 'host_ref': 'OpaqueRef:359a920d-7bb1-4088-8b3e-42254f111f51', 'session_ref': 'OpaqueRef:2ec5ec53-65f6-4ab3-a626-bdd87e9df0e4', 'device_config': {'group-name': 'linstor_group/thin_device', 'redundancy': '3', 'hosts': 'uk.dc1.xcp-ng-hyper1,uk.dc1.xcp-ng-hyper2,uk.dc1.xcp-ng-hyper3,uk.dc1.xcp-ng-hyper4', 'SRmaster': 'false', 'provisioning': 'thin'}, 'command': 'sr_attach', 'sr_ref': 'OpaqueRef:f62acb08-116b-42e4-90df-e7d2153ed610', 'local_cache_sr': '28b8eb58-a6a2-c2fa-ad1e-b339b531330f'}
      Apr 27 13:35:45 uk SMGC: [8730] === SR a20ee08c-40d0-9818-084f-282bbca1f217: abort ===
      Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active
      Apr 27 13:35:45 uk SM: [8730] lock: tried lock /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active, acquired: True (exists: True)
      Apr 27 13:35:45 uk SMGC: [8730] abort: releasing the process lock
      Apr 27 13:35:45 uk SM: [8730] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/gc_active
      Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:35:45 uk SM: [8730] RESET for SR a20ee08c-40d0-9818-084f-282bbca1f217 (master: False)
      Apr 27 13:35:45 uk SM: [8730] lock: released /var/lock/sm/a20ee08c-40d0-9818-084f-282bbca1f217/running
      Apr 27 13:35:45 uk SM: [8730] lock: opening lock file /var/lock/sm/.nil/lvm
      Apr 27 13:35:45 uk SM: [8730] lock: acquired /var/lock/sm/.nil/lvm
      Apr 27 13:35:45 uk SM: [8730] ['/sbin/vgchange', '-ay', 'linstor_group']
      Apr 27 13:35:46 uk SM: [8730]   pread SUCCESS
      Apr 27 13:35:46 uk SM: [8730] lock: released /var/lock/sm/.nil/lvm
      Apr 27 13:38:07 uk SM: [8730] Got exception: Unable to find controller uri.... Retry number: 0
      
      F 1 Reply Last reply Reply Quote 0
      • F Offline
        fred974 @fred974
        last edited by

        @ronan-a Could you please help me please?

        brodiecyberB 1 Reply Last reply Reply Quote 0
        • brodiecyberB Offline
          brodiecyber @fred974
          last edited by

          @fred974 it looks like the logs say it cant find the controller can you confirm if it is running on any of your hosts.

          im no expert but its a start

          controller.PNG

          F 1 Reply Last reply Reply Quote 0
          • F Offline
            fred974 @brodiecyber
            last edited by fred974

            @brodiecyber Thank you for your help. I ran the command but got this message

            [16:28 uk ~]# xe host-call-plugin host-uuid=5a1e10ec-4f1a-469d-d5h7-adb8535741ca plugin=linstor-manager fn=has-ControllerRunning
            Error code: UNKNOWN_XENAPI_PLUGIN_FUNCTION
            Error parameters: has-ControllerRunning
            
            F 1 Reply Last reply Reply Quote 0
            • F Offline
              fred974 @fred974
              last edited by

              @brodiecyber I found the correct cmd..

              [16:45 uk ~]# xe host-call-plugin host-uuid=5a1e10ec-4f1a-469d-d5h7-adb8535741ca plugin=linstor-manager fn=hasControllerRunning
              False
              

              linstor resource list

              [16:49 uk ~]# linstor resource list
              Error: Unable to connect to linstor://localhost:3370: [Errno 99] Cannot assign requested address
              
              brodiecyberB 1 Reply Last reply Reply Quote 0
              • brodiecyberB Offline
                brodiecyber @fred974
                last edited by

                @fred974

                So we know its not running to that host as the command is targeting a single hosts uuid.

                have you run it on all your XCP-ng servers to confirm if the controller is running on any of them. perhaps check to make sure the lunstor controller is available on any host

                F 1 Reply Last reply Reply Quote 0
                • F Offline
                  fred974 @brodiecyber
                  last edited by fred974

                  @brodiecyber yes, it says 'False' on all 4x hosts. The controller isn't running on any hosts

                  brodiecyberB 3 Replies Last reply Reply Quote 0
                  • brodiecyberB Offline
                    brodiecyber @fred974
                    last edited by

                    @fred974

                    ok so thats our problem the lunstor controller is not running on any node so XOSTOR has no way to initialize. Im going to do some reading and see what comes up

                    1 Reply Last reply Reply Quote 0
                    • brodiecyberB Offline
                      brodiecyber @fred974
                      last edited by

                      @fred974

                      Maybe also post in the XOSTOR thread
                      A controller failure should be on the radar for possible failure scenarios.
                      https://xcp-ng.org/forum/topic/5361/xostor-hyperconvergence-preview

                      1 Reply Last reply Reply Quote 0
                      • brodiecyberB Offline
                        brodiecyber @fred974
                        last edited by

                        @fred974

                        also see if this script is available on the last host that has the controller which we can assume is the host that failed the first time. Maybe it has some data on why the controller isn't initializing

                        controller 2.PNG

                        F 1 Reply Last reply Reply Quote 0
                        • F Offline
                          fred974 @brodiecyber
                          last edited by

                          @brodiecyber I can see it exist in /opt/xensource/bin/linstor-kv-tool but no idea on how to use it

                          1 Reply Last reply Reply Quote 0
                          • F fred974 referenced this topic on
                          • First post
                            Last post