@xcprocks said in Restoring a downed host ISNT easy:
So, we had a host go down (OS drive failure). No big deal right? According to instructions, just reinstall XCP on a new drive, jump over into XOA and do a metadata restore.
Well, not quite.
First during installation, you really really must not select any of the disks to create an SR as you could potentially wipe out an SR.
Second, you have to do the sr-probe and sr-introduce and pbd-create and pbd-plug to get the SRs back.
Third, you then have to use XOA to restore the metadata which according to the directions is pretty simple looking. According to: https://xen-orchestra.com/docs/metadata_backup.html#performing-a-restore
"To restore one, simply click the blue restore arrow, choose a backup date to restore, and click OK:"
But this isn't quite true. When we did it, the restore threw an error:
"message": "no such object d7b6f090-cd68-9dec-2e00-803fc90c3593",
"name": "XoError",
Panic mode sets in... It can't find the metadata? We try an earlier backup. Same error. We check the backup NFS share--no its there alright.
After a couple of hours scouring the internet and not finding anything, it dawns on us... The object XOA is looking for is the OLD server not a backup directory. It is looking for the server that died and no longer exists. The problem is, when you install the new server, it gets a new ID. But the restore program is looking for the ID of the dead server.
But how do you tell XOA, to copy the metadata over to the new server? It assumes that you want to restore it over an existing server. It does not provide a drop down list to pick where to deploy it.
In an act of desperation, we copied the backup directory to a new location and named it with the ID number of the newly recreated server. Now XOA could restore the metadata and we were able to recover the VMs in the SRs without issue.
This long story is really just a way to highlight the need for better host backup in three ways:
A) The first idea would be to create better instructions. It ain't nowhere as easy as the documentation says it is and it's easy to mess up the first step so bad that you can wipe out the contents of an SR. The documentation should spell this out.
B) The second idea is to add to the metadata backup something that reads the states of SR to PBD mappings and provides/saves a script to restore them. This would ease a lot of the difficulty in the actual restoring of a failed OS after a new OS can be installed.
C) The third idea is provide a dropdown during the restoration of the metadata that allows the user to target a particular machine for the restore operation instead of blindly assuming you want to restore it over a machine that is dead and gone.
I hope this helps out the next person trying to bring a host back from the dead, and I hope it also helps make XOA a better product.
Thanks for a good description of the restore process.
I was wary of the metadata-backup option. It sounds simple and good to have, but as you said it is in no way a comprehensive restore of a pool.
I'd like to add my own oppinion here. A full pool restore, including network, re-attaching SRs and everything else that is needed to quickly get back up and running. Also a restore pool backup should be available on the boot media. It could look for a NFS/CIFS mount or a USB disk with the backup files on. This would avoid things like issues with bonded networks not working.