@justjosh Yes. VMs on master had intermittent network connectivity. We saw high load average on the master DOM-0 I think the processes there were tap disk IIRC. Couldn't ping anything from the master or to the master. Everything was normal on the slaves.
Posts made by JamuelStarkey
-
RE: Pool master and slaves cannot communicate with each other but can reach everything else
-
RE: Pool master and slaves cannot communicate with each other but can reach everything else
@justjosh we just had to do a tool stack restart on one (out of four) of the slaves. The other three just reconnected as soon as the master completed its restart. VMs on the slaves were completely unaffected. The VMs on the master had to have power state reset and then they started normally. I think most of the VMs ran auto fsck (CentOS 7) and one needed a little help with fsck but all recovered and nothing was lost.
-
RE: Pool master and slaves cannot communicate with each other but can reach everything else
Not sure that you call this clean or graceful. We hemmed and hawed over what the best path was (emergency elect a new master, reboot the master, etc). But we've only seen this one time (hasn't recurred in over 2 years) and eventually, unfortunately settled on forcibly restarting the master as it wouldn't even shut on its own. Guests on the master had to have their power-state forcibly reset after the master came up clean.
We probably spent 4 hours degraded not wanting to choose the reboot option since we had running VMs but the problem was cleared after a simple reboot and 10 minutes of hard down time. One lesson learned was limit the damage that a failing/failed master can cause by not running critical VMs on the master.