@Danp How can I preserve or recover the local SRs of the dead host?
Posts
-
RE: Hosts in a pool have gone offline after reboot
-
RE: Hosts in a pool have gone offline after reboot
@Danp I didn't do anything. The master host failed on its own and stopped responding to XO.
I've rebooted the host and the hardware all seems fine. The logs suggest that XAPI is not running because the database is missing a column (see above, first comment).
-
RE: Hosts in a pool have gone offline after reboot
@nikade It looks like I cannot get the dead host to rejoin the pool using
xe pool-join
:You attempted an operation that was not allowed. reason: Host is already part of a pool
Will I have problems if I try to force it to join with
xe pool-join force
? A forum post seems to suggest that this may propagate data corruption errors from the dead host to the pool, which is obviously undesireable. So how would I avoid that? -
RE: Hosts in a pool have gone offline after reboot
@Danp Is there some documentation you would recommend on how to safely forget a host? I'm confronted with dire warnings on how this will permanently destroy the SRs used by the VMs that used to run on the dead host. So, I want to make really sure I won't be doing something wrong here.
Thanks!
-
RE: Hosts in a pool have gone offline after reboot
@Danp So the saga continues:
I designated the sole running host as the new master. It did this happily and in fact also discovered one of the other hosts - the one that was not the old master. So far so good.
I was able to then take a look at the list of VMs, then force any VMs "running" on the dead host (the old master) to be halted. Now the dead host only has the XCP control plane running.
All that is left is to get the dead host forgotten from the pool and then rejoin the pool, right?
-
RE: Hosts in a pool have gone offline after reboot
@Danp I did a
yum update
and axe-toolstack-restart
on all three hosts, made no difference.I also tried doing an emergency network reset on just the master, but no difference. I think that XAPI isn't up at all because of the database.
Will a reinstall of XCP work? Some forum entries seem to suggest so, but I'm leery of how fragile this seems to be.
-
RE: Hosts in a pool have gone offline after reboot
No, the pool master is not running. The logs posted are from the machine that was the pool master.
The machine boots but the management interface (console) has no NIC, and no network.
-
RE: Hosts in a pool have gone offline after reboot
@Danp Sorry about the cross post. I realised I might have put it in the wrong section of the forum, as this might not be related to XO management. But that's where I first encountered it, so good enough.
All of the hosts are running the most up to date version, and the patches are all up to date as of right now. I cannot be absolutely certain that the slaves were not rebooted before the master - I was adopting a new slave a week or two ago, which failed at first. So that might have been rebooted first.
-
RE: Hosts in a pool have gone offline after reboot
After my cluster rebooted, my hosts have gone offline and I can't get them back up.
There are three hosts in the pool, and I can only reach a VM that is sitting on one of the three hosts.
I see a few issues in the logs:
xapi-nbd[5695]: main: Failed to log in via xapi's Unix domain socket in 300.000000 seconds
In xensource.log:
Mar 25 13:28:43 pythia xapi: [ warn||0 ||startup] task [starting up database engine] exception: Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] server_init *****a4d4 failed with exception Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] Raised Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] 1/1 xapi Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
As far as I can tell, the database has gone and corrupted itself, preventing the XAPI server from starting, which then prevents XO / etc. from running.
Oh sage ones, anyone have an idea on how to fix this?
-
Hosts in a pool have gone offline after reboot
After my cluster rebooted, my hosts have gone offline and I can't get them back up.
There are three hosts in the pool, and I can only reach a VM that is sitting on one of the three hosts.
I see a few issues in the logs:
xapi-nbd[5695]: main: Failed to log in via xapi's Unix domain socket in 300.000000 seconds
In xensource.log:
Mar 25 13:28:43 pythia xapi: [ warn||0 ||startup] task [starting up database engine] exception: Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] server_init *****a4d4 failed with exception Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] Raised Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations") Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] 1/1 xapi Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
As far as I can tell, the database has gone and corrupted itself, preventing the XAPI server from starting, which then prevents XO / etc. from running.
Oh sage ones, anyone have an idea on how to fix this?