Hosts fencing after latest 8.2.1 update

jasonmap

On Sunday I did pool updates for the latest patches and everything seemed to go fine for the most part. However, the last two nights I have had hosts suddenly decide they needed to be fenced and reboot. On Monday morning it was 2 of them, and then this morning 4 of them that rebooted (it's a 5 node pool).

Did any of the most recent updates make HA a bit more punchy or is that coincidental?

Interpreting the xha log is pretty rough for me at this point so I'm having a trouble determining the exact source of the problem - it certainly seems to be a state file disagreement. I'm thinking maybe the backups task, which kicks off at midnight, is having a negative impact to the storage performance enough to cause this - but that's only a guess.

Any insight would be appreciated!

xha.log:

Apr 23 00:26:41 CDT 2024 [debug] WD: now - settime = 38038 timeout = 75000.
Apr 23 00:26:41 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:44 CDT 2024 [debug] WD: now - settime = 41040 timeout = 75000.
Apr 23 00:26:44 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:47 CDT 2024 [debug] WD: now - settime = 44042 timeout = 75000.
Apr 23 00:26:47 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:50 CDT 2024 [debug] WD: now - settime = 47044 timeout = 75000.
Apr 23 00:26:50 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:53 CDT 2024 [debug] WD: now - settime = 50047 timeout = 75000.
Apr 23 00:26:53 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:55 CDT 2024 [debug] FH: Start fault handler.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (0), time since last SF update = 56037.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (1), time since last SF update = 55969.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (2), time since last SF update = 55825.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (3), time since last SF update = 55606.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (4), time since last SF update = 55390.
Apr 23 00:26:55 CDT 2024 [debug] FH: waiting for SF from host (5), time since last SF update = 55194.
Apr 23 00:26:56 CDT 2024 [debug] WD: now - settime = 53049 timeout = 75000.
Apr 23 00:26:56 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:59 CDT 2024 [debug] WD: now - settime = 56050 timeout = 75000.
Apr 23 00:26:59 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:26:59 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (_11111)].
Apr 23 00:27:00 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (0_1111)].
Apr 23 00:27:00 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (00_111)].
Apr 23 00:27:00 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (000_11)].
Apr 23 00:27:00 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (0000_1)].
Apr 23 00:27:00 CDT 2024 [debug] SM: SF domain is updated [sfdomain = (00000_)].
Apr 23 00:27:00 CDT 2024 [debug] FH: HB/SF state has become stable.
Apr 23 00:27:00 CDT 2024 [debug] FH: weight value [1] is commited.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[0].time_since_last_update_on_sf=61256.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[1].time_since_last_update_on_sf=61188.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[2].time_since_last_update_on_sf=61044.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[3].time_since_last_update_on_sf=60825.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[4].time_since_last_update_on_sf=60609.
Apr 23 00:27:01 CDT 2024 [warn] SC: (script_service_do_query_liveset) reporting "State-File approaching timeout". host[5].time_since_last_update_on_sf=60413.
Apr 23 00:27:02 CDT 2024 [debug] WD: now - settime = 58829 timeout = 75000.
Apr 23 00:27:02 CDT 2024 [info] WD: watchdog has not been updated at least for half timeout id=1 label=statefile.
Apr 23 00:27:03 CDT 2024 [debug] FH: waiting for consistent view...
Apr 23 00:27:41 CDT 2024 [info] SC: (script_service_do_query_liveset) "State-file approaching timeout" turned FALSE 
Apr 23 00:28:03 CDT 2024 [warn] Host (1) and the local host do not agre on the view to the pool membership.
Apr 23 00:28:03 CDT 2024 [warn]         local HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         local SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         remote HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         remote SF domain = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         remote HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         remote SF domain on HB = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         weight[0] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[1] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[2] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[3] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[4] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[5] = 1
Apr 23 00:28:03 CDT 2024 [warn] after merger:
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (011111)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain = (111111)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other HB domain on SF = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         other SF domain on HB = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         weight[0] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[1] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[2] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[3] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[4] = 1
Apr 23 00:28:03 CDT 2024 [warn]         weight[5] = 1
Apr 23 00:28:03 CDT 2024 [warn]         merged HB domain = (000000)
Apr 23 00:28:03 CDT 2024 [warn]         merged SF domain = (000000)
Apr 23 00:28:03 CDT 2024 [debug] FH: All hosts now have consistent view to the pool membership.
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[0] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[1] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[2] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[3] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[4] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: partition_size[5] = 0 0
Apr 23 00:28:06 CDT 2024 [warn] SM: winner_index = -1
Apr 23 00:28:06 CDT 2024 [debug] FH: I have lost.
Apr 23 00:28:06 CDT 2024 [err] Survival rule failed (1905) FH: Survival Rule is not met for the local host.  - Self-Fence.
Apr 23 00:28:06 CDT 2024 [info] watchdog_selffence.

olivierlambert

Hi,

I would disable HA for a bit and keep an eye on the shared storage connectivity for a bit (eg a simple dmesg in the Dom0 can display very interesting info). This will reduce the potential disruption until you find the root cause. I'm not aware of any change in XHA recently (pinging @ronan-a in case)

ronan-a

@jasonmap No recent changes regarding the HA on our repositories. So it can be a connection issue. Like said by Olivier, you can have more info using dmesg. Or even daemon.log/SMlog (it will depend on the shared SR used).

jasonmap

Right, so yeah I did just that - disabled HA and have been keeping an eye on the logs as well as general performance in the env. Glad to know my actions align with your recommendations at least

There are a couple of hosts that get a lot of these sorts of messages in the kern.log:

Apr 25 08:15:33 oryx kernel: [276757.645457] vif vif-21-3 vif21.3: Guest Rx ready
Apr 25 08:15:54 oryx kernel: [276778.735509] vif vif-22-3 vif22.3: Guest Rx stalled
Apr 25 08:15:54 oryx kernel: [276778.735522] vif vif-21-3 vif21.3: Guest Rx stalled
Apr 25 08:16:04 oryx kernel: [276788.780828] vif vif-21-3 vif21.3: Guest Rx ready
Apr 25 08:16:04 oryx kernel: [276788.780836] vif vif-22-3 vif22.3: Guest Rx ready

Am I wrong to attribute this to issues within specific VMs (i.e. not a hv performance issue)?

I know one of the VMs that causes these is a very old centos 5 testing VM one of my devs use and the messages stop when it's powered down.

Is there any way to easily associate those vifs with the actual VMs they are attached to? My google-foo failed me for that.

Other than that, I noticed my nic firmware is a bit old on the X710-da2's I use so I'm going through and upgrading those with no noticeable changes. I'm fairly hesitant to re-enable HA without tracking down the root cause.