@ronan-a
As promised I have done some more "organised" testing, with a brand new cluster set up to test XOSTOR. Early simple tests seemed to be OK and I can create, restart, snapshot, move and delete VMs with no problem.
But then after putting a VM under load for an hour and then restarting it I am seeing the same weird behaviour I saw previously with XOSTOR.
Firstly, the VM took far longer than expected to shutdown. Then trying to restart the VM fails.
Let me know if you want me to provide logs or any specific testing.
vm.start
{
"id": "ade699f2-42f0-8629-35ea-6fcc69de99d7",
"bypassMacAddressesCheck": false,
"force": false
}
{
"code": "SR_BACKEND_FAILURE_1200",
"params": [
"",
"No such Tapdisk(minor=12)",
""
],
"call": {
"method": "VM.start",
"params": [
"OpaqueRef:a60d0553-a2f2-41e6-9df4-fad745fbacc8",
false,
false
]
},
"message": "SR_BACKEND_FAILURE_1200(, No such Tapdisk(minor=12), )",
"name": "XapiError",
"stack": "XapiError: SR_BACKEND_FAILURE_1200(, No such Tapdisk(minor=12), )
at Function.wrap (/opt/xo/xo-builds/xen-orchestra-202206111352/packages/xen-api/src/_XapiError.js:16:12)
at /opt/xo/xo-builds/xen-orchestra-202206111352/packages/xen-api/src/transports/json-rpc.js:37:27
at AsyncResource.runInAsyncScope (async_hooks.js:197:9)
at cb (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/util.js:355:42)
at tryCatcher (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:547:31)
at Promise._settlePromise (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:604:18)
at Promise._settlePromise0 (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:649:10)
at Promise._settlePromises (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:729:18)
at _drainQueueStep (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:93:12)
at _drainQueue (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:86:9)
at Async._drainQueues (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:102:5)
at Immediate.Async.drainQueues [as _onImmediate] (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:15:14)
at processImmediate (internal/timers.js:464:21)
at process.callbackTrampoline (internal/async_hooks.js:130:17)"
}
The linstore volume for this VM appears to be OK (by the way I hope eventually we have an easy way to match up VMs to Linstore volumes)
[16:53 XCPNG31 ~]# linstor volume list | grep xcp-volume-00b34ae3-2ad3-44ea-aa13-d5de1fbf756f
| XCPNG30 | xcp-volume-00b34ae3-2ad3-44ea-aa13-d5de1fbf756f | xcp-sr-linstor_group | 0 | 1001 | /dev/drbd1001 | 40.10 GiB | InUse | UpToDate |
| XCPNG31 | xcp-volume-00b34ae3-2ad3-44ea-aa13-d5de1fbf756f | xcp-sr-linstor_group | 0 | 1001 | /dev/drbd1001 | 40.10 GiB | Unused | UpToDate |
| XCPNG32 | xcp-volume-00b34ae3-2ad3-44ea-aa13-d5de1fbf756f | DfltDisklessStorPool | 0 | 1001 | /dev/drbd1001 | | Unused | Diskless |
[16:59 XCPNG31 ~]# linstor node list
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā Node ā NodeType ā Addresses ā State ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā”
ā XCPNG30 ā COMBINED ā 192.168.1.30:3366 (PLAIN) ā Online ā
ā XCPNG31 ā COMBINED ā 192.168.1.31:3366 (PLAIN) ā Online ā
ā XCPNG32 ā COMBINED ā 192.168.1.32:3366 (PLAIN) ā Online ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
[17:13 XCPNG31 ~]# linstor storage-pool list
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā®
ā StoragePool ā Node ā Driver ā PoolName ā FreeCapacity ā TotalCapacity ā CanSnapshots ā State ā SharedName ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā”
ā DfltDisklessStorPool ā XCPNG30 ā DISKLESS ā ā ā ā False ā Ok ā ā
ā DfltDisklessStorPool ā XCPNG31 ā DISKLESS ā ā ā ā False ā Ok ā ā
ā DfltDisklessStorPool ā XCPNG32 ā DISKLESS ā ā ā ā False ā Ok ā ā
ā xcp-sr-linstor_group ā XCPNG30 ā LVM ā linstor_group ā 3.50 TiB ā 3.64 TiB ā False ā Ok ā ā
ā xcp-sr-linstor_group ā XCPNG31 ā LVM ā linstor_group ā 3.49 TiB ā 3.64 TiB ā False ā Ok ā ā
ā xcp-sr-linstor_group ā XCPNG32 ā LVM ā linstor_group ā 3.50 TiB ā 3.64 TiB ā False ā Ok ā ā
ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ
I am seeing the following errors in the SMLog (bb027c9a-5655-4f93-9090-e76b34b2c90d is the disk for this VM)
Jul 23 16:48:30 XCPNG30 SM: [22961] lock: opening lock file /var/lock/sm/bb027c9a-5655-4f93-9090-e76b34b2c90d/vdi
Jul 23 16:48:30 XCPNG30 SM: [22961] blktap2.deactivate
Jul 23 16:48:30 XCPNG30 SM: [22961] lock: acquired /var/lock/sm/bb027c9a-5655-4f93-9090-e76b34b2c90d/vdi
Jul 23 16:48:30 XCPNG30 SM: [22961] ['/usr/sbin/tap-ctl', 'close', '-p', '19527', '-m', '2', '-t', '30']
Jul 23 16:49:00 XCPNG30 SM: [22961] = 5
Jul 23 16:49:00 XCPNG30 SM: [22961] ***** BLKTAP2:<function _deactivate_locked at 0x7f6fb33208c0>: EXCEPTION <class 'blktap2.CommandFailure'>, ['/usr/sbin/tap-ctl', 'close', '-p', '19527', '-m', '2', '-t', '30'] failed: status=5, pid=22983, errmsg=Input/output error
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 85, in wrapper
Jul 23 16:49:00 XCPNG30 SM: [22961] ret = op(self, *args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1744, in _deactivate_locked
Jul 23 16:49:00 XCPNG30 SM: [22961] self._deactivate(sr_uuid, vdi_uuid, caching_params)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1785, in _deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] self._tap_deactivate(minor)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1368, in _tap_deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] tapdisk.shutdown()
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 880, in shutdown
Jul 23 16:49:00 XCPNG30 SM: [22961] TapCtl.close(self.pid, self.minor, force)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 433, in close
Jul 23 16:49:00 XCPNG30 SM: [22961] cls._pread(args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 296, in _pread
Jul 23 16:49:00 XCPNG30 SM: [22961] tapctl._wait(quiet)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 285, in _wait
Jul 23 16:49:00 XCPNG30 SM: [22961] raise self.CommandFailure(self.cmd, **info)
Jul 23 16:49:00 XCPNG30 SM: [22961]
Jul 23 16:49:00 XCPNG30 SM: [22961] lock: released /var/lock/sm/bb027c9a-5655-4f93-9090-e76b34b2c90d/vdi
Jul 23 16:49:00 XCPNG30 SM: [22961] call-plugin on 7aaaf4a5-0e43-442e-a9b1-38620c87fd69 (linstor-manager:lockVdi with {'groupName': 'linstor_group', 'srUuid': '141d63f6-d3ed-4a2f-588a-1835f0cea588', 'vdiUuid': 'bb027c9a-5655-4f93-9090-e76b34b2c90d', 'locked': 'False'}) returned: True
Jul 23 16:49:00 XCPNG30 SM: [22961] ***** generic exception: vdi_deactivate: EXCEPTION <class 'blktap2.CommandFailure'>, ['/usr/sbin/tap-ctl', 'close', '-p', '19527', '-m', '2', '-t', '30'] failed: status=5, pid=22983, errmsg=Input/output error
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 110, in run
Jul 23 16:49:00 XCPNG30 SM: [22961] return self._run_locked(sr)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked
Jul 23 16:49:00 XCPNG30 SM: [22961] rv = self._run(sr, target)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 274, in _run
Jul 23 16:49:00 XCPNG30 SM: [22961] caching_params)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1729, in deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] if self._deactivate_locked(sr_uuid, vdi_uuid, caching_params):
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 85, in wrapper
Jul 23 16:49:00 XCPNG30 SM: [22961] ret = op(self, *args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1744, in _deactivate_locked
Jul 23 16:49:00 XCPNG30 SM: [22961] self._deactivate(sr_uuid, vdi_uuid, caching_params)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1785, in _deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] self._tap_deactivate(minor)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1368, in _tap_deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] tapdisk.shutdown()
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 880, in shutdown
Jul 23 16:49:00 XCPNG30 SM: [22961] TapCtl.close(self.pid, self.minor, force)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 433, in close
Jul 23 16:49:00 XCPNG30 SM: [22961] cls._pread(args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 296, in _pread
Jul 23 16:49:00 XCPNG30 SM: [22961] tapctl._wait(quiet)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 285, in _wait
Jul 23 16:49:00 XCPNG30 SM: [22961] raise self.CommandFailure(self.cmd, **info)
Jul 23 16:49:00 XCPNG30 SM: [22961]
Jul 23 16:49:00 XCPNG30 SM: [22961] ***** LINSTOR resources on XCP-ng: EXCEPTION <class 'blktap2.CommandFailure'>, ['/usr/sbin/tap-ctl', 'close', '-p', '19527', '-m', '2', '-t', '30'] failed: status=5, pid=22983, errmsg=Input/output error
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 378, in run
Jul 23 16:49:00 XCPNG30 SM: [22961] ret = cmd.run(sr)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 110, in run
Jul 23 16:49:00 XCPNG30 SM: [22961] return self._run_locked(sr)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked
Jul 23 16:49:00 XCPNG30 SM: [22961] rv = self._run(sr, target)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/SRCommand.py", line 274, in _run
Jul 23 16:49:00 XCPNG30 SM: [22961] caching_params)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1729, in deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] if self._deactivate_locked(sr_uuid, vdi_uuid, caching_params):
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 85, in wrapper
Jul 23 16:49:00 XCPNG30 SM: [22961] ret = op(self, *args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1744, in _deactivate_locked
Jul 23 16:49:00 XCPNG30 SM: [22961] self._deactivate(sr_uuid, vdi_uuid, caching_params)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1785, in _deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] self._tap_deactivate(minor)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 1368, in _tap_deactivate
Jul 23 16:49:00 XCPNG30 SM: [22961] tapdisk.shutdown()
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 880, in shutdown
Jul 23 16:49:00 XCPNG30 SM: [22961] TapCtl.close(self.pid, self.minor, force)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 433, in close
Jul 23 16:49:00 XCPNG30 SM: [22961] cls._pread(args)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 296, in _pread
Jul 23 16:49:00 XCPNG30 SM: [22961] tapctl._wait(quiet)
Jul 23 16:49:00 XCPNG30 SM: [22961] File "/opt/xensource/sm/blktap2.py", line 285, in _wait
Jul 23 16:49:00 XCPNG30 SM: [22961] raise self.CommandFailure(self.cmd, **info)
Jul 23 16:49:00 XCPNG30 SM: [22961]
If I try to start this VM after the above issue I get this error:
vm.start
{
"id": "ade699f2-42f0-8629-35ea-6fcc69de99d7",
"bypassMacAddressesCheck": false,
"force": false
}
{
"code": "FAILED_TO_START_EMULATOR",
"params": [
"OpaqueRef:a60d0553-a2f2-41e6-9df4-fad745fbacc8",
"domid 12",
"QMP failure at File \"xc/device.ml\", line 3366, characters 71-78"
],
"call": {
"method": "VM.start",
"params": [
"OpaqueRef:a60d0553-a2f2-41e6-9df4-fad745fbacc8",
false,
false
]
},
"message": "FAILED_TO_START_EMULATOR(OpaqueRef:a60d0553-a2f2-41e6-9df4-fad745fbacc8, domid 12, QMP failure at File \"xc/device.ml\", line 3366, characters 71-78)",
"name": "XapiError",
"stack": "XapiError: FAILED_TO_START_EMULATOR(OpaqueRef:a60d0553-a2f2-41e6-9df4-fad745fbacc8, domid 12, QMP failure at File \"xc/device.ml\", line 3366, characters 71-78)
at Function.wrap (/opt/xo/xo-builds/xen-orchestra-202206111352/packages/xen-api/src/_XapiError.js:16:12)
at /opt/xo/xo-builds/xen-orchestra-202206111352/packages/xen-api/src/transports/json-rpc.js:37:27
at AsyncResource.runInAsyncScope (async_hooks.js:197:9)
at cb (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/util.js:355:42)
at tryCatcher (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/util.js:16:23)
at Promise._settlePromiseFromHandler (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:547:31)
at Promise._settlePromise (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:604:18)
at Promise._settlePromise0 (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:649:10)
at Promise._settlePromises (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/promise.js:729:18)
at _drainQueueStep (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:93:12)
at _drainQueue (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:86:9)
at Async._drainQueues (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:102:5)
at Immediate.Async.drainQueues [as _onImmediate] (/opt/xo/xo-builds/xen-orchestra-202206111352/node_modules/bluebird/js/release/async.js:15:14)
at processImmediate (internal/timers.js:464:21)
at process.callbackTrampoline (internal/async_hooks.js:130:17)"
}
This VM is set up the same way as several other VMs I have running on another cluster but using SRs on NFS mounts.
This is with latest XCP-ng version, all servers patched and up-to-date.
yum update
...
No packages marked for update