Pool master and slaves cannot communicate with each other but can reach everything else

justjosh

Hi all,

Overnight our pool went into a weird situation where the the master seemed to see all slaves as offline.
Upon investigation, it seems like all nodes are still online and not in emergency mode.
All nodes still think that they have the same master in the pool.conf file.
Able to SSH into all nodes including the master and access all parts of the network.
No isuses with connectivity with iSCSI storage.
All slaves can ping each other except the master.
VMs that are NOT on the master node seem to be running fine.
VMs on the master node are behaving weird (most have no internet connectivity).
XAPI service is running on all hosts (although master has this extra warning line "Warning: xapi.service changed on disk. Run 'systemctl daemon-reload' to reload units."
XAPI commands seem to hang on slaves (xe sr-list/vm-list/host-list)
Unable to log into slaves on XCP-ng Center because it prompts to log into master and master sees all slaves as offline.

What is the cleanest way to gracefully fix this? Maybe transition one of the slaves into the master?

Thanks!

JamuelStarkey

Not sure that you call this clean or graceful. We hemmed and hawed over what the best path was (emergency elect a new master, reboot the master, etc). But we've only seen this one time (hasn't recurred in over 2 years) and eventually, unfortunately settled on forcibly restarting the master as it wouldn't even shut on its own. Guests on the master had to have their power-state forcibly reset after the master came up clean.

We probably spent 4 hours degraded not wanting to choose the reboot option since we had running VMs but the problem was cleared after a simple reboot and 10 minutes of hard down time. One lesson learned was limit the damage that a failing/failed master can cause by not running critical VMs on the master.

justjosh

@JamuelStarkey Can I just confirm that you had the same network issues where communication between master and slave was severed but master was still connected to the internet? Did you not have to touch the slaves at all?

JamuelStarkey

@justjosh we just had to do a tool stack restart on one (out of four) of the slaves. The other three just reconnected as soon as the master completed its restart. VMs on the slaves were completely unaffected. The VMs on the master had to have power state reset and then they started normally. I think most of the VMs ran auto fsck (CentOS 7) and one needed a little help with fsck but all recovered and nothing was lost.

justjosh

@JamuelStarkey Just want to reconfirm this, when you had the issue, the master was still connected to everything single thing on the network just unable to see slaves?

JamuelStarkey

@justjosh Yes. VMs on master had intermittent network connectivity. We saw high load average on the master DOM-0 I think the processes there were tap disk IIRC. Couldn't ping anything from the master or to the master. Everything was normal on the slaves.

justjosh

Just updating for anyone that has the same issue, we ended up just rebooting the master and like @JamuelStarkey said everything just automatically fell in place. Did have to exit maintenance mode on the master and replug the PBD but everything else went back to normal immediately.

Still frustrating to experience and would really love to know what caused this. If there's any logs I can pull to figure this out do let me know @olivierlambert