can't start vm after host disconnect
-
@olivierlambert
that does nothing on vm that i disconnected and connected vdi:
VDI 2004692e-68ef-464f-b7e5-1258f7fc3f4a is not marked as attached anywhere, nothing to doon the other vm's that i didn't touch:
[22:57 white ~]# /opt/xensource/sm/resetvdis.py single 9c126ed8-f33e-48f8-8cd4-dd8b165b9e05 Traceback (most recent call last): File "/opt/xensource/sm/resetvdis.py", line 170, in <module> reset_vdi(session, vdi_uuid, force) File "/opt/xensource/sm/resetvdis.py", line 103, in reset_vdi {"vdiUuid": vdi_uuid, "srRef": vdi_rec["SR"]}) File "/usr/lib/python2.7/site-packages/XenAPI.py", line 264, in __call__ return self.__send(self.__name, args) File "/usr/lib/python2.7/site-packages/XenAPI.py", line 160, in xenapi_request result = _parse_result(getattr(self, methodname)(*full_params)) File "/usr/lib/python2.7/site-packages/XenAPI.py", line 238, in _parse_result raise Failure(result['ErrorDescription']) XenAPI.Failure: ['HOST_OFFLINE', 'OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f']
-
So far, the situation is that we have lost all the virtual machines running on this host. They can't even be copied...
-
@alex821982 @dave-opc How are you guys associated?
Have you checked the logs to identify the source of the issue?
-
@Danp Yes, I'm talking about the same situation, I'm also involved. Can you tell me what exactly and how to look at it?
-
@alex821982 In XOA, you can check Settings > Logs for error related details. Otherwise, you can review our documentation on
XCP-ng log files. -
The situation has now been resolved. We started this server, the power supply failed. But I would like to understand the situation in order to prevent such a situation in the future, because the server might not start at all...
as I understand it, is this still abnormal behavior of the system?
Otherwise, it turns out that we have backup servers and can quickly restore VM operation on them, but we cannot do this because of this lock, which we could not just remove... -
@Danp In XOA, we only see this when starting the VM
SR_BACKEND_FAILURE_46(, The VDI is not available [opterr=['HOST_OFFLINE', 'OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f']], )
-
@alex821982 Have you determined why this host is down? Do you plan to bring it back online or will you be removing it from the pool?
-
@Danp So I wrote The power supply is broken. Replaced, the server is working. He appeared again in the pool and these VMs began to start normally.
In which log exactly and what should I look for with such a problem?
-
@alex821982 The prior link I posted explains the various log files.
xensource.log
is likely where you would find errors related to starting a VM. -
Jul 16 21:57:04 white xapi: [debug||1603 ||dummytaskhelper] task scan one D:a22cff56bbd3 created by task D:b3479eff4821 Jul 16 21:57:04 white xapi: [debug||1604 ||dummytaskhelper] task scan one D:8c3a9b520d8e created by task D:b3479eff4821 Jul 16 21:57:04 white xapi: [debug||1605 ||dummytaskhelper] task scan one D:28b5060aad6b created by task D:b3479eff4821 Jul 16 21:57:04 white xapi: [debug||1606 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.slave_login D:ebd35a8b12eb created by task D:a22cff56bbd3 Jul 16 21:57:04 white xapi: [debug||1608 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.slave_login D:1e68b1a3c796 created by task D:8c3a9b520d8e Jul 16 21:57:04 white xapi: [debug||1607 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.slave_login D:6e422ee40904 created by task D:28b5060aad6b Jul 16 21:57:04 white xapi: [ info||1608 /var/lib/xcp/xapi|session.slave_login D:3e904e130a78|xapi_session] Session.create trackid=4edfde1a7d5928a08283f775b5bc21f0 pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Jul 16 21:57:04 white xapi: [ info||1607 /var/lib/xcp/xapi|session.slave_login D:b378e197f827|xapi_session] Session.create trackid=c7a80191d498ec0eaf1fb5decfd45439 pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Jul 16 21:57:04 white xapi: [ info||1606 /var/lib/xcp/xapi|session.slave_login D:9074b1c527f5|xapi_session] Session.create trackid=97b82963d8a60fcfca9cddefdf556ea4 pool=true uname= originator=xapi is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Jul 16 21:57:04 white xapi: [debug||1609 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:6a400dd143e4 created by task D:3e904e130a78 Jul 16 21:57:04 white xapi: [debug||1611 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:97b66417986f created by task D:9074b1c527f5 Jul 16 21:57:04 white xapi: [debug||1610 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:pool.get_all D:3a5a8af93835 created by task D:b378e197f827 Jul 16 21:57:04 white xapi: [debug||1612 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.scan D:cd88fa12a6fe created by task D:8c3a9b520d8e Jul 16 21:57:04 white xapi: [ info||1612 /var/lib/xcp/xapi||taskhelper] task SR.scan R:8d59eef4dce3 (uuid:d54184df-d37e-c802-8373-005de8bbc99e) created (trackid=4edfde1a7d5928a08283f775b5bc21f0) by task D:8c3a9b520d8e Jul 16 21:57:04 white xapi: [debug||1612 /var/lib/xcp/xapi|SR.scan R:8d59eef4dce3|message_forwarding] SR.scan: SR = 'ca73e958-871e-f723-9987-47c7357ab412 ([green] md126)' Jul 16 21:57:04 white xapi: [debug||1612 /var/lib/xcp/xapi|SR.scan R:8d59eef4dce3|message_forwarding] Marking SR for SR.scan (task=OpaqueRef:8d59eef4-dce3-44b5-892d-1d328b0e23b5) Jul 16 21:57:04 white xapi: [debug||1613 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.scan D:26c6618493bd created by task D:28b5060aad6b Jul 16 21:57:04 white xapi: [debug||1614 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:SR.scan D:a969ebb4b56b created by task D:a22cff56bbd3 Jul 16 21:57:04 white xapi: [debug||1612 /var/lib/xcp/xapi|SR.scan R:8d59eef4dce3|message_forwarding] Unmarking SR after SR.scan (task=OpaqueRef:8d59eef4-dce3-44b5-892d-1d328b0e23b5) Jul 16 21:57:04 white xapi: [ info||1613 /var/lib/xcp/xapi||taskhelper] task SR.scan R:c1bc8f0ca345 (uuid:08314c1e-25a0-b591-3465-0edb4a51d715) created (trackid=c7a80191d498ec0eaf1fb5decfd45439) by task D:28b5060aad6b Jul 16 21:57:04 white xapi: [debug||1613 /var/lib/xcp/xapi|SR.scan R:c1bc8f0ca345|message_forwarding] SR.scan: SR = '3002955a-cd78-e0d2-c70e-50444dc5b9a3 ([green] md124 HDD)' Jul 16 21:57:04 white xapi: [ info||1614 /var/lib/xcp/xapi||taskhelper] task SR.scan R:78836b2f9a5a (uuid:23f823bc-a597-24c2-b906-6b9d632b7565) created (trackid=97b82963d8a60fcfca9cddefdf556ea4) by task D:a22cff56bbd3 Jul 16 21:57:04 white xapi: [debug||1614 /var/lib/xcp/xapi|SR.scan R:78836b2f9a5a|message_forwarding] SR.scan: SR = 'b6ce7cd8-50a0-cdf4-7bc4-59359e34e91e ([green] md125)' Jul 16 21:57:04 white xapi: [debug||1613 /var/lib/xcp/xapi|SR.scan R:c1bc8f0ca345|message_forwarding] Marking SR for SR.scan (task=OpaqueRef:c1bc8f0c-a345-4e35-adc1-da5552510069) Jul 16 21:57:04 white xapi: [debug||1614 /var/lib/xcp/xapi|SR.scan R:78836b2f9a5a|message_forwarding] Marking SR for SR.scan (task=OpaqueRef:78836b2f-9a5a-4bdf-8efe-d8f4c4f6994f) Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] SR.scan R:8d59eef4dce3 failed with exception Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] Raised Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 1/9 xapi Raised at file ocaml/xapi/message_forwarding.ml, line 124 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 2/9 xapi Called from file ocaml/xapi/message_forwarding.ml, line 160 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 3/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 4/9 xapi Called from file ocaml/xapi/rbac.ml, line 205 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 5/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 95 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 6/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 113 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 7/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 8/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] 9/9 xapi Called from file lib/backtrace.ml, line 177 Jul 16 21:57:04 white xapi: [error||1612 /var/lib/xcp/xapi||backtrace] Jul 16 21:57:04 white xapi: [debug||1613 /var/lib/xcp/xapi|SR.scan R:c1bc8f0ca345|message_forwarding] Unmarking SR after SR.scan (task=OpaqueRef:c1bc8f0c-a345-4e35-adc1-da5552510069) Jul 16 21:57:04 white xapi: [debug||1614 /var/lib/xcp/xapi|SR.scan R:78836b2f9a5a|message_forwarding] Unmarking SR after SR.scan (task=OpaqueRef:78836b2f-9a5a-4bdf-8efe-d8f4c4f6994f) Jul 16 21:57:04 white xapi: [debug||1604 |scan one D:8c3a9b520d8e|helpers] Ignoring exception: HOST_OFFLINE: [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ] while scanning SR OpaqueRef:dd7b4611-8f5a-4f09-8f05-8b982af995ab Jul 16 21:57:04 white xapi: [debug||1615 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:31680f55f572 created by task D:8c3a9b520d8e Jul 16 21:57:04 white xapi: [ info||1615 /var/lib/xcp/xapi|session.logout D:f2e32cc368b8|xapi_session] Session.destroy trackid=4edfde1a7d5928a08283f775b5bc21f0 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] SR.scan R:c1bc8f0ca345 failed with exception Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] Raised Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 1/9 xapi Raised at file ocaml/xapi/message_forwarding.ml, line 124 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 2/9 xapi Called from file ocaml/xapi/message_forwarding.ml, line 160 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 3/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 4/9 xapi Called from file ocaml/xapi/rbac.ml, line 205 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 5/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 95 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 6/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 113 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 7/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 8/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] 9/9 xapi Called from file lib/backtrace.ml, line 177 Jul 16 21:57:04 white xapi: [error||1613 /var/lib/xcp/xapi||backtrace] Jul 16 21:57:04 white xapi: [debug||1604 |scan one D:8c3a9b520d8e|xapi_sr] Scan of SR ca73e958-871e-f723-9987-47c7357ab412 complete. Jul 16 21:57:04 white xapi: [debug||1605 |scan one D:28b5060aad6b|helpers] Ignoring exception: HOST_OFFLINE: [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ] while scanning SR OpaqueRef:7bce81eb-c4cc-4d1a-bb24-406a7789ae95 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] SR.scan R:78836b2f9a5a failed with exception Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] Raised Server_error(HOST_OFFLINE, [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ]) Jul 16 21:57:04 white xapi: [debug||1616 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:f0d39db694bb created by task D:28b5060aad6b Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 1/9 xapi Raised at file ocaml/xapi/message_forwarding.ml, line 124 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 2/9 xapi Called from file ocaml/xapi/message_forwarding.ml, line 160 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 3/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 4/9 xapi Called from file ocaml/xapi/rbac.ml, line 205 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 5/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 95 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 6/9 xapi Called from file ocaml/xapi/server_helpers.ml, line 113 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 7/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 8/9 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] 9/9 xapi Called from file lib/backtrace.ml, line 177 Jul 16 21:57:04 white xapi: [error||1614 /var/lib/xcp/xapi||backtrace] Jul 16 21:57:04 white xapi: [ info||1616 /var/lib/xcp/xapi|session.logout D:5b9f353f5804|xapi_session] Session.destroy trackid=c7a80191d498ec0eaf1fb5decfd45439 Jul 16 21:57:04 white xapi: [debug||1605 |scan one D:28b5060aad6b|xapi_sr] Scan of SR 3002955a-cd78-e0d2-c70e-50444dc5b9a3 complete. Jul 16 21:57:04 white xapi: [debug||1603 |scan one D:a22cff56bbd3|helpers] Ignoring exception: HOST_OFFLINE: [ OpaqueRef:03e275e3-56df-477d-b940-5ba78247ce2f ] while scanning SR OpaqueRef:440cb488-7c08-45d4-89df-74d84a41d072 Jul 16 21:57:04 white xapi: [debug||1617 /var/lib/xcp/xapi||dummytaskhelper] task dispatch:session.logout D:2e22356b283d created by task D:a22cff56bbd3 Jul 16 21:57:04 white xapi: [ info||1617 /var/lib/xcp/xapi|session.logout D:1c028c7b965d|xapi_session] Session.destroy trackid=97b82963d8a60fcfca9cddefdf556ea4 Jul 16 21:57:04 white xapi: [debug||1603 |scan one D:a22cff56bbd3|xapi_sr] Scan of SR b6ce7cd8-50a0-cdf4-7bc4-59359e34e91e complete.
-
@Danp said in can't start vm after host disconnect:
xensource.log
If you search for this error, which is also in XOA, then here is a piece that relates to this.
-
@dave-opc I just ran into this disaster too, but a little different.
Here's what I did:
My pool master's hardware failed (HA not enabled). I could not wait to replace the hardware (motherboard VRM failure) and had additional resources on site anyway (N+2 hosts). All of the VMs are on shared storage so I did not need to recover a SR from the failed host.
It was a total mess.... pool master dead, XO VM on the dead host, important VMs still showing as 'running' on the dead host.
I have a second backup XO VM that does not run any tasks but gives me off-poll management access (I have several XO's ready, including VirtualBox on Windows). But without a master there was nothing to see or do to the pool.
Then I had to restore pool with a new master functions:
On a different, but running host in the pool, I forced a new master:xe pool-emergency-transition-to-master sleep 10 xe pool-recover-slaves
After there was a new master, I reset the power for VMs stuck in limbo on the dead host:
xe vm-list xe vm-reset-powerstate vm=VM_UUID --force
Then I had to kick the totally dead host out of the pool:
xe host-list xe host-declare-dead uuid=DEAD_HOST xe host-forget uuid=DEAD_HOST
I tired just declaring it dead but that was not good enough. VMs would not restart because they wanted to start on the dead host and then they would not start on a new host because they had issues with the SR. The shared SR was still 'attached' to the dead host and could not be removed. Also backups were still trying to reach the dead host. So, I had to forget the dead host and move on.
I would have liked to rejoined the dead host to the pool but it will take a few days to revive the server so it had to be forgotten by the pool. I'll just have to reformat the rebuilt host node and join as a new pool member.
-
@Andrew Your situation is even worse. When your master disappeared, did you also lose control of the pool? Although the master should be transferred to another host automatically if it is unavailable for a long time? Why is this not happening? I didn't really understand when you deleted the host on which these VMs were running from the pool, after that your VMs started on the second host?
In general, it seems that these are very serious bugs, having a fault-tolerant system scheme, we essentially lose it because of this behavior.
-
Correct, NO master = NO pool management (the VMs keep running).
- If HA is enabled, another master is elected automatically.
- If HA is not enabled, each member waits for the master to return.
I deleted the dead host (old master) because cause even when I marked it as dead (from the new master) the VMs from it would not restart and the backups were still trying to communicate with it. Deleting it from the pool seemed the only way, or at least the quickest, to restore functionality.
I'll have to look into HA a little more and it's issues. It's simple to turn on, but has a few complications/consequences in normal use...
-
@Andrew said in can't start vm after host disconnect:
I'll have to look into HA a little more and it's issues. It's simple to turn on, but has a few complications/consequences in normal use...
And which ones, for example? Can we just not enable it in the VM settings, then we will only have the master transfer functionality? I forgot that it doesn't work if HA is turned off)
But the rest of the situation when the machines are hanging on and nothing can be done with them is it still a bug? They should just turn off. -
@alex821982 Here are the HA docs.
-
You need to think about data coherency. As a human, you know that your server was physically dead (PSU dead). But from XAPI perspective, what if it was just the management network dead? The VM will continue to run correctly, but there's no way for XAPI to know it. So if you decide to boot the VM again, maybe it will corrupt the data (having the VM run at 2 places with the same disk: catastrophic corruption).
That's why, by default, it prevented you to start the VM because it couldn't contact the host that might have still the VM running, leading to catastrophic corruption.
In HA, there's an extra mechanism (storage heartbeat), helping to make a better decision (at the cost of auto fencing host that couldn't join the HA SR).
-
@Andrew
I read this) that's why I wrote about the fact that you can not enable HA on each VM, but use this function only for automatic transfer of the master
Well, okay, you've strayed from the subject, we still have another problem... -
@olivierlambert
what about when system doesn't know that host if offline, but i know for sure that host is down, and i need manual control over starting/copying vm.
why then in those command exists --force, if it's not helping in anyway?