XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Pool master and slaves cannot communicate with each other but can reach everything else

    Scheduled Pinned Locked Moved XCP-ng
    7 Posts 2 Posters 693 Views 1 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      justjosh
      last edited by justjosh

      Hi all,

      Overnight our pool went into a weird situation where the the master seemed to see all slaves as offline.
      Upon investigation, it seems like all nodes are still online and not in emergency mode.
      All nodes still think that they have the same master in the pool.conf file.
      Able to SSH into all nodes including the master and access all parts of the network.
      No isuses with connectivity with iSCSI storage.
      All slaves can ping each other except the master.
      VMs that are NOT on the master node seem to be running fine.
      VMs on the master node are behaving weird (most have no internet connectivity).
      XAPI service is running on all hosts (although master has this extra warning line "Warning: xapi.service changed on disk. Run 'systemctl daemon-reload' to reload units."
      XAPI commands seem to hang on slaves (xe sr-list/vm-list/host-list)
      Unable to log into slaves on XCP-ng Center because it prompts to log into master and master sees all slaves as offline.

      What is the cleanest way to gracefully fix this? Maybe transition one of the slaves into the master?

      Thanks!

      1 Reply Last reply Reply Quote 0
      • JamuelStarkeyJ Offline
        JamuelStarkey
        last edited by JamuelStarkey

        Not sure that you call this clean or graceful. We hemmed and hawed over what the best path was (emergency elect a new master, reboot the master, etc). But we've only seen this one time (hasn't recurred in over 2 years) and eventually, unfortunately settled on forcibly restarting the master as it wouldn't even shut on its own. Guests on the master had to have their power-state forcibly reset after the master came up clean.

        We probably spent 4 hours degraded not wanting to choose the reboot option since we had running VMs but the problem was cleared after a simple reboot and 10 minutes of hard down time. One lesson learned was limit the damage that a failing/failed master can cause by not running critical VMs on the master.

        J 1 Reply Last reply Reply Quote 0
        • J Offline
          justjosh @JamuelStarkey
          last edited by

          @JamuelStarkey Can I just confirm that you had the same network issues where communication between master and slave was severed but master was still connected to the internet? Did you not have to touch the slaves at all?

          JamuelStarkeyJ 1 Reply Last reply Reply Quote 0
          • JamuelStarkeyJ Offline
            JamuelStarkey @justjosh
            last edited by JamuelStarkey

            @justjosh we just had to do a tool stack restart on one (out of four) of the slaves. The other three just reconnected as soon as the master completed its restart. VMs on the slaves were completely unaffected. The VMs on the master had to have power state reset and then they started normally. I think most of the VMs ran auto fsck (CentOS 7) and one needed a little help with fsck but all recovered and nothing was lost.

            J 1 Reply Last reply Reply Quote 1
            • J Offline
              justjosh @JamuelStarkey
              last edited by

              @JamuelStarkey Just want to reconfirm this, when you had the issue, the master was still connected to everything single thing on the network just unable to see slaves?

              JamuelStarkeyJ 1 Reply Last reply Reply Quote 0
              • JamuelStarkeyJ Offline
                JamuelStarkey @justjosh
                last edited by

                @justjosh Yes. VMs on master had intermittent network connectivity. We saw high load average on the master DOM-0 I think the processes there were tap disk IIRC. Couldn't ping anything from the master or to the master. Everything was normal on the slaves.

                J 1 Reply Last reply Reply Quote 0
                • J Offline
                  justjosh @JamuelStarkey
                  last edited by justjosh

                  Just updating for anyone that has the same issue, we ended up just rebooting the master and like @JamuelStarkey said everything just automatically fell in place. Did have to exit maintenance mode on the master and replug the PBD but everything else went back to normal immediately.

                  Still frustrating to experience and would really love to know what caused this. If there's any logs I can pull to figure this out do let me know @olivierlambert

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post