XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Hosts in a pool have gone offline after reboot

    Scheduled Pinned Locked Moved Management
    20 Posts 3 Posters 1.2k Views 2 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      Aeoran
      last edited by

      After my cluster rebooted, my hosts have gone offline and I can't get them back up.

      There are three hosts in the pool, and I can only reach a VM that is sitting on one of the three hosts.

      I see a few issues in the logs:

      xapi-nbd[5695]: main: Failed to log in via xapi's Unix domain socket in 300.000000 seconds
      

      In xensource.log:

      Mar 25 13:28:43 pythia xapi: [ warn||0 ||startup] task [starting up database engine] exception: Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
      Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] server_init *****a4d4 failed with exception Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
      Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] Raised Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
      Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] 1/1 xapi Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
      

      As far as I can tell, the database has gone and corrupted itself, preventing the XAPI server from starting, which then prevents XO / etc. from running.

      Oh sage ones, anyone have an idea on how to fix this?

      1 Reply Last reply Reply Quote 0
      • A Offline
        Aeoran
        last edited by

        After my cluster rebooted, my hosts have gone offline and I can't get them back up.

        There are three hosts in the pool, and I can only reach a VM that is sitting on one of the three hosts.

        I see a few issues in the logs:

        xapi-nbd[5695]: main: Failed to log in via xapi's Unix domain socket in 300.000000 seconds
        

        In xensource.log:

        Mar 25 13:28:43 pythia xapi: [ warn||0 ||startup] task [starting up database engine] exception: Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
        Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] server_init *****a4d4 failed with exception Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
        Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] Raised Db_exn.DBCache_NotFound("missing column", "VM", "recomMendations")
        Mar 25 13:28:43 pythia xapi: [error||0 ||backtrace] 1/1 xapi Raised at file (Thread 0 has no backtrace table. Was with_backtraces called?, line 0
        

        As far as I can tell, the database has gone and corrupted itself, preventing the XAPI server from starting, which then prevents XO / etc. from running.

        Oh sage ones, anyone have an idea on how to fix this?

        1 Reply Last reply Reply Quote 0
        • DanpD Offline
          Danp Pro Support Team
          last edited by

          No cross posting please; topics merged

          1 Reply Last reply Reply Quote 0
          • DanpD Offline
            Danp Pro Support Team
            last edited by

            What version of XCP are you running? Did you recently install patches to your pool? If yes, did you make sure to reboot the pool master first?

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              Aeoran @Danp
              last edited by

              @Danp Sorry about the cross post. I realised I might have put it in the wrong section of the forum, as this might not be related to XO management. But that's where I first encountered it, so good enough.

              All of the hosts are running the most up to date version, and the patches are all up to date as of right now. I cannot be absolutely certain that the slaves were not rebooted before the master - I was adopting a new slave a week or two ago, which failed at first. So that might have been rebooted first.

              1 Reply Last reply Reply Quote 0
              • DanpD Offline
                Danp Pro Support Team
                last edited by

                Is the pool master up and running?

                1 Reply Last reply Reply Quote 0
                • A Offline
                  Aeoran
                  last edited by

                  No, the pool master is not running. The logs posted are from the machine that was the pool master.

                  The machine boots but the management interface (console) has no NIC, and no network.

                  1 Reply Last reply Reply Quote 0
                  • DanpD Offline
                    Danp Pro Support Team
                    last edited by

                    Have you checked to see if there are any pending updates on the pool master by running yum update?

                    I've never encountered this particular error, but here are some other things you could try --

                    • Emergency network reset, which can be done from the CLI or from within xsconsole

                    • Force one of your slaves to become the new master using xe pool-emergency-transition-to-master. You can read more about this in this thread.

                    I hope that you have good backups of your VMs. 😉

                    A 2 Replies Last reply Reply Quote 0
                    • A Offline
                      Aeoran @Danp
                      last edited by

                      @Danp I did a yum update and a xe-toolstack-restart on all three hosts, made no difference.

                      I also tried doing an emergency network reset on just the master, but no difference. I think that XAPI isn't up at all because of the database.

                      Will a reinstall of XCP work? Some forum entries seem to suggest so, but I'm leery of how fragile this seems to be.

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        Aeoran @Danp
                        last edited by

                        @Danp So the saga continues:

                        I designated the sole running host as the new master. It did this happily and in fact also discovered one of the other hosts - the one that was not the old master. So far so good.

                        I was able to then take a look at the list of VMs, then force any VMs "running" on the dead host (the old master) to be halted. Now the dead host only has the XCP control plane running.

                        All that is left is to get the dead host forgotten from the pool and then rejoin the pool, right?

                        DanpD 1 Reply Last reply Reply Quote 0
                        • DanpD Offline
                          Danp Pro Support Team @Aeoran
                          last edited by

                          @Aeoran Yes, that sounds like a good plan.

                          A 1 Reply Last reply Reply Quote 0
                          • A Offline
                            Aeoran @Danp
                            last edited by

                            @Danp Is there some documentation you would recommend on how to safely forget a host? I'm confronted with dire warnings on how this will permanently destroy the SRs used by the VMs that used to run on the dead host. So, I want to make really sure I won't be doing something wrong here.

                            Thanks!

                            1 Reply Last reply Reply Quote 0
                            • nikadeN Offline
                              nikade Top contributor
                              last edited by

                              Shared storage should belong to the pool, only local SR should be affected when you forget the old master.
                              Just make sure all the slaves know about the new master before doing anything to the old one.

                              A 1 Reply Last reply Reply Quote 0
                              • A Offline
                                Aeoran @nikade
                                last edited by

                                @nikade It looks like I cannot get the dead host to rejoin the pool using xe pool-join:

                                You attempted an operation that was not allowed.
                                reason: Host is already part of a pool
                                

                                Will I have problems if I try to force it to join with xe pool-join force? A forum post seems to suggest that this may propagate data corruption errors from the dead host to the pool, which is obviously undesireable. So how would I avoid that?

                                1 Reply Last reply Reply Quote 0
                                • nikadeN Offline
                                  nikade Top contributor
                                  last edited by

                                  Not really sure, I'd ask @olivierlambert to be sure.

                                  1 Reply Last reply Reply Quote 0
                                  • DanpD Offline
                                    Danp Pro Support Team
                                    last edited by

                                    What actions did you initially perform to remove the host from the pool?

                                    A 1 Reply Last reply Reply Quote 0
                                    • A Offline
                                      Aeoran @Danp
                                      last edited by

                                      @Danp I didn't do anything. The master host failed on its own and stopped responding to XO.

                                      I've rebooted the host and the hardware all seems fine. The logs suggest that XAPI is not running because the database is missing a column (see above, first comment).

                                      1 Reply Last reply Reply Quote 0
                                      • DanpD Offline
                                        Danp Pro Support Team
                                        last edited by

                                        You probably need to forget the host using xe host-forget uuid=UUID where UUID belongs to the old pool master.

                                        See prior discussion on this topic -- https://xcp-ng.org/forum/topic/6164/remove-a-host-from-a-pool/14

                                        A 1 Reply Last reply Reply Quote 0
                                        • A Offline
                                          Aeoran @Danp
                                          last edited by

                                          @Danp How can I preserve or recover the local SRs of the dead host?

                                          DanpD 1 Reply Last reply Reply Quote 0
                                          • DanpD Offline
                                            Danp Pro Support Team @Aeoran
                                            last edited by

                                            @Aeoran AFAIK, the XAPI database gets wiped whenever you add or remove the host from a pool. You may be able to restore metadata to the old master once it is no longer belongs to the pool, but I can't guarantee that this will work or not produce other issues.

                                            If you don't have backups of the VMs, then you should be able to copy the VHD files to another location by accessing the directory /run/sr-mount/<SR UUID>/ on the old master.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post