XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    can't start vm after host disconnect

    Scheduled Pinned Locked Moved XCP-ng
    29 Posts 5 Posters 2.8k Views 3 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      alex821982 @Danp
      last edited by

      @Danp said in can't start vm after host disconnect:

      xensource.log

      If you search for this error, which is also in XOA, then here is a piece that relates to this.

      1 Reply Last reply Reply Quote 0
      • A Offline
        Andrew Top contributor @dave.opc
        last edited by

        @dave-opc I just ran into this disaster too, but a little different.

        Here's what I did:

        My pool master's hardware failed (HA not enabled). I could not wait to replace the hardware (motherboard VRM failure) and had additional resources on site anyway (N+2 hosts). All of the VMs are on shared storage so I did not need to recover a SR from the failed host.

        It was a total mess.... pool master dead, XO VM on the dead host, important VMs still showing as 'running' on the dead host.

        I have a second backup XO VM that does not run any tasks but gives me off-poll management access (I have several XO's ready, including VirtualBox on Windows). But without a master there was nothing to see or do to the pool.

        Then I had to restore pool with a new master functions:
        On a different, but running host in the pool, I forced a new master:

        xe pool-emergency-transition-to-master
        sleep 10
        xe pool-recover-slaves
        

        After there was a new master, I reset the power for VMs stuck in limbo on the dead host:

        xe vm-list
        xe vm-reset-powerstate vm=VM_UUID --force
        

        Then I had to kick the totally dead host out of the pool:

        xe host-list
        xe host-declare-dead uuid=DEAD_HOST
        xe host-forget uuid=DEAD_HOST
        

        I tired just declaring it dead but that was not good enough. VMs would not restart because they wanted to start on the dead host and then they would not start on a new host because they had issues with the SR. The shared SR was still 'attached' to the dead host and could not be removed. Also backups were still trying to reach the dead host. So, I had to forget the dead host and move on.

        I would have liked to rejoined the dead host to the pool but it will take a few days to revive the server so it had to be forgotten by the pool. I'll just have to reformat the rebuilt host node and join as a new pool member.

        A 1 Reply Last reply Reply Quote 0
        • A Offline
          alex821982 @Andrew
          last edited by

          @Andrew Your situation is even worse. When your master disappeared, did you also lose control of the pool? Although the master should be transferred to another host automatically if it is unavailable for a long time? Why is this not happening? I didn't really understand when you deleted the host on which these VMs were running from the pool, after that your VMs started on the second host?

          In general, it seems that these are very serious bugs, having a fault-tolerant system scheme, we essentially lose it because of this behavior.

          A 1 Reply Last reply Reply Quote 0
          • A Offline
            Andrew Top contributor @alex821982
            last edited by

            @alex821982

            Correct, NO master = NO pool management (the VMs keep running).

            • If HA is enabled, another master is elected automatically.
            • If HA is not enabled, each member waits for the master to return.

            I deleted the dead host (old master) because cause even when I marked it as dead (from the new master) the VMs from it would not restart and the backups were still trying to communicate with it. Deleting it from the pool seemed the only way, or at least the quickest, to restore functionality.

            I'll have to look into HA a little more and it's issues. It's simple to turn on, but has a few complications/consequences in normal use...

            A 1 Reply Last reply Reply Quote 0
            • A Offline
              alex821982 @Andrew
              last edited by

              @Andrew said in can't start vm after host disconnect:

              I'll have to look into HA a little more and it's issues. It's simple to turn on, but has a few complications/consequences in normal use...

              And which ones, for example? Can we just not enable it in the VM settings, then we will only have the master transfer functionality? I forgot that it doesn't work if HA is turned off)
              But the rest of the situation when the machines are hanging on and nothing can be done with them is it still a bug? They should just turn off.

              A 1 Reply Last reply Reply Quote 0
              • A Offline
                Andrew Top contributor @alex821982
                last edited by

                @alex821982 Here are the HA docs.

                A 1 Reply Last reply Reply Quote 0
                • olivierlambertO Online
                  olivierlambert Vates 🪐 Co-Founder CEO
                  last edited by olivierlambert

                  You need to think about data coherency. As a human, you know that your server was physically dead (PSU dead). But from XAPI perspective, what if it was just the management network dead? The VM will continue to run correctly, but there's no way for XAPI to know it. So if you decide to boot the VM again, maybe it will corrupt the data (having the VM run at 2 places with the same disk: catastrophic corruption).

                  That's why, by default, it prevented you to start the VM because it couldn't contact the host that might have still the VM running, leading to catastrophic corruption.

                  In HA, there's an extra mechanism (storage heartbeat), helping to make a better decision (at the cost of auto fencing host that couldn't join the HA SR).

                  D A 2 Replies Last reply Reply Quote 0
                  • A Offline
                    alex821982 @Andrew
                    last edited by

                    @Andrew
                    I read this) that's why I wrote about the fact that you can not enable HA on each VM, but use this function only for automatic transfer of the master
                    Well, okay, you've strayed from the subject, we still have another problem...

                    1 Reply Last reply Reply Quote 0
                    • D Offline
                      dave.opc @olivierlambert
                      last edited by dave.opc

                      @olivierlambert
                      what about when system doesn't know that host if offline, but i know for sure that host is down, and i need manual control over starting/copying vm.
                      why then in those command exists --force, if it's not helping in anyway?

                      1 Reply Last reply Reply Quote 0
                      • A Offline
                        alex821982 @olivierlambert
                        last edited by

                        @olivierlambert
                        I understood everything now what the problem was. HA is not included. As I understood it in this case, the VMs would be turned off automatically and not locked, right?
                        The only thing of course remains the question that dave.opc asked

                        1 Reply Last reply Reply Quote 0
                        • A Offline
                          alex821982 @olivierlambert
                          last edited by

                          @olivierlambert
                          I also wanted to ask just about HA and auto reboot to VM
                          If HA is enabled on the VM

                          and auto reboot is enabled in the guest system

                          Then nothing bad should happen? Because there will be an attempt to start the VM, but it will reboot anyway and just start

                          1 Reply Last reply Reply Quote 0
                          • olivierlambertO Online
                            olivierlambert Vates 🪐 Co-Founder CEO
                            last edited by

                            If HA is enabled, then the pool should elect a new master (if it's the master), and restart all the VMs.

                            But be careful with HA: if your SR access is blocked (eg a network issue on your SR), all the host will auto fence and reboot.

                            1 Reply Last reply Reply Quote 0
                            • A Offline
                              alex821982
                              last edited by alex821982

                              I still wanted to understand, is there anything wrong if I reboot from inside the VM not with XOA? Although it is written that when all shutdowns and reboots are enabled, they must be performed with XOA so that the system understands (although it is strange to me, if there is a guest-tool in the system, then XCP should understand everything that happens in the VM itself) or not?
                              And yet, is it possible to reboot from inside the VM itself? It will just reboot, in my opinion, HA won't even have time to work here, or even if it does, it will send a command to start the machine, but it will boot anyway
                              I have auto-reboot on which machines in the scheduler at night, it is necessary.

                              1 Reply Last reply Reply Quote 0
                              • olivierlambertO Online
                                olivierlambert Vates 🪐 Co-Founder CEO
                                last edited by

                                Can you be more specific about what do you expect in terms of behavior?

                                D 1 Reply Last reply Reply Quote 0
                                • D Offline
                                  dave.opc @olivierlambert
                                  last edited by

                                  @olivierlambert
                                  with HA enabled, if Windows VM will reboot from windows system scheduler - is this ok? Will this affect somehow on VM?

                                  1 Reply Last reply Reply Quote 0
                                  • olivierlambertO Online
                                    olivierlambert Vates 🪐 Co-Founder CEO
                                    last edited by

                                    No, from the XCP-ng point of view, the VM is still running without any interruption.

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post