AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state

Bug #1844717 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
   Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state .
system host-if-list controller-1
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 24cd614e-bc22-46f1-802f-66a8be6d9b19 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 29779448-8151-45b3-b229-9d78bfc35159 | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] | [] | [u'mgmt0', | MTU=9216 |
| | | | | | | | u'cluster0'] | |
| | | | | | | | | |
| 5ff56ca6-e675-492b-9704-4521afbad331 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True |
| 62749374-4856-4a94-8e78-af35834b6561 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
  fm alarm-list
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-09-19T17: |
| | | | | 55:22.213177 |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 2607:4100:2:ff::2 | | 55:22.208122 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9fcb:848 | | 55:22.203294 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9538:7910 | | 55:22.201814 |
| | | | | |
| 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::d173:b566 | | 43:42.434742 |
| | | | | |
| 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: |
| | | | | 28:37.345704 |
| | | | | |
| 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: |
| | | | | 28:37.340704 |
| | | | | |
| 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: |
| | | | | 28:37.285653 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: |
| | | | | 28:37.180633 |
| | | | | |
| 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: |
| | | | | 28:36.274734 |
| | | | | |
| 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: |
| | failure. | network=Cluster-host | | 28:33.043678 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.078781 |
| | | service_group= | | |
| | | directory-services | | |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: |
| | active member available | controller. | | 26:37.070086 |
| | | service_group=web- | | |
| | | services | | |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.064781 |
| | | service_group=storage- | | |
| | | services | | |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T02: |
| | | 2607:4100:2:ff::2 | | 03:43.611747 |
| | | | | |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+

ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms
64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms

system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-2 | worker | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | disabled | failed |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 5 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
ping: controller-1: Name or service not known
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host
PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms

--- controller-1-cluster-host ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms

Severity
--------
Major

Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management.
3. Wait for 30 seconds and put back the cable.
4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever.
System Configuration
--------------------
AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12

Expected Behavior
------------------
Failure recovered after cable was put back.

Actual Behavior
----------------
As per description never recovered
Reproducibility
---------------
Able Test only once lab was not recovered . Reinstall was required.
Load
----
Build was on 2019-09-18_20-00-00
Last Pass
---------
Timestamp/Logs
--------------
2019-09-19T17:28:33.
Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - standby controller does not recover after split brain/cable pull test

summary: - AIO-DX+ worker:After cable pull test cluster network failure alarm was
- never cleared when cluster network is back and ping able
+ AIO-DX+ worker: After cable pull test/split brain test, controller-1
+ remains in a failed state
tags: added: stx.3.0 stx.ha
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :
Download full text (14.7 KiB)

This is more like an expected scenario, out of blind guess survivor selection protocol.
Despite some noise from a few services failure (presumably related to cables pulled), the SM on active controller (controller-0) issued reboot request to mtcAgent at 2019-09-19T17:27:19.286. Then the controller-1 reboot as expected.
At 2019-09-19T17:34:13 after controller-1 boot up, mtcAgent decided to perform graceful recovery, thus it sent a 2nd reboot request to mtcClient on controller-1 and unlocked-disabled request to SM.
After 2nd reboot, controller-1 went standby successfully at 2019-09-19T17:39:42.
Significant logs list below.
@Eric MacDonald please confirm the 2nd reboot is expected to perform graceful recovery.

| 2019-09-19T17:26:08.389 | 333 | interface-scn | oam-interface | enabled | disabled | eno1 disabled
| 2019-09-19T17:26:10.424 | 334 | interface-scn | cluster-host-interface | enabled | disabled | enp24s0f0.187 disabled
| 2019-09-19T17:26:10.425 | 335 | interface-scn | management-interface | enabled | disabled | enp24s0f0.186 disabled
| 2019-09-19T17:27:09.749 | 336 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed

| 2019-09-19T17:27:19.310 | 377 | service-scn | etcd | enabled-active | disabled | process (pid=943968) failed
| 2019-09-19T17:27:19.393 | 378 | service-scn | registry-token-server | enabled-active | disabled | process (pid=943102) failed
| 2019-09-19T17:27:19.393 | 379 | service-scn | docker-distribution | enabled-active | disabled | process (pid=944125) failed
| 2019-09-19T17:27:20.557 | 391 | service-scn | docker-distribution | enabling | disabling | enable failed
| 2019-09-19T17:27:20.956 | 394 | service-group-scn | controller-services | active | active-failed | docker-distribution(disabled, failed)

2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3118> INFO: sm[107746]: sm_failover_ss.c(520): Host from active to active, Peer from standby to standby.
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3119> INFO: sm[107746]: sm_failover_fsm.cpp(68): Exiting survived state
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.217] log<3120> INFO: sm[107746]: sm_failover_fsm.cpp(62): Entering normal state
2019-09-19T14:58:34.000 controller-0 sm: debug time[47104.249] log<3121> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state-changed
2019-09-19T17:26:09.000 controller-0 sm: debug time[55957.679] log<3709> INFO: sm[107746]: sm_failover_fsm.cpp(107): send_event interface-state...

Changed in starlingx:
assignee: Bin Qian (bqian20) → nobody
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :
Download full text (8.9 KiB)

In looking at the logs I can see that the second reboot occurred as a result of the Cluster network heartbeat failure immediately following the recovery of controller-1 after the initial reboot.

Here is the log analysis

Heartbeat Loss due to cable pull

2019-09-19T17:26:37.186 [3163376.00210] controller-0 mtcAgent hbs nodeClass.cpp (4696) manage_heartbeat_failure:Error : controller-1 Mgmnt *** Heartbeat Loss ***
2019-09-19T17:26:37.188 [3163376.00219] controller-0 mtcAgent hbs nodeClass.cpp (4696) manage_heartbeat_failure:Error : controller-1 Clstr *** Heartbeat Loss ***
2019-09-19T17:26:42.189 [3163376.00304] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1846) recovery_handler : Warn : controller-1 Loss Of Communication for 5 seconds ; disabling host
2019-09-19T17:26:42.189 [3163376.00305] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1847) recovery_handler : Warn : controller-1 ... stopping host services
2019-09-19T17:26:42.189 [3163376.00306] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1848) recovery_handler : Warn : controller-1 ... continuing with graceful recovery
2019-09-19T17:26:42.189 [3163376.00310] controller-0 mtcAgent hbs nodeClass.cpp (1672) alarm_enabled_failure :Error : controller-1 critical enable failure
2019-09-19T17:26:42.205 [3163376.00318] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1954) recovery_handler : Info : controller-1 Graceful Recovery Wait (1200 secs) (uptime was 0)

Cable reinserted resulting in observed mtcAlive.

2019-09-19T17:27:19.267 [3163376.00338] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2016) recovery_handler : Info : controller-1 regained MTCALIVE from host that did not reboot (uptime:9231)
2019-09-19T17:27:19.267 [3163376.00339] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2017) recovery_handler : Info : controller-1 ... uptimes before:0 after:9231
2019-09-19T17:27:19.267 [3163376.00340] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2018) recovery_handler : Info : controller-1 ... exiting graceful recovery
2019-09-19T17:27:19.267 [3163376.00341] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (2019) recovery_handler : Info : controller-1 ... forcing full enable with reset
2019-09-19T17:27:19.267 [3163376.00342] controller-0 mtcAgent |-| nodeClass.cpp (7325) force_full_enable : Info : controller-1 Forcing Full Enable Sequence
2019-09-19T17:27:19.267 [3163376.00343] controller-0 mtcAgent hbs nodeClass.cpp (5941) allStateChange : Info : controller-1 unlocked-disabled-failed (seq:12)
2019-09-19T17:27:19.277 [3163376.00344] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp ( 560) enable_handler :Error : controller-1 Main Enable FSM (from failed)
2019-09-19T17:27:19.277 [3163376.00345] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-0 hbsAgent
2019-09-19T17:27:19.277 [3163376.00346] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 878) send_hbs_command : Info : controller-1 stop host service sent to controller-1 hbsAgent
2019-09-19T17:27:19.283 [107366.01375] controller-0 hbsAgent hbs nodeClass.cpp (7505) mon_host ...

Read more...

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Does this issue happen on systems that don't use VLANs for management and cluster on top of single interface ?

Would help to understand impact and scope for priority and who investigates.

Yang Liu (yliu12)
tags: added: stx.retestneeded
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Bin Qian (bqian20) → Eric MacDonald (rocksolidmtce)
tags: added: stx.metal
removed: stx.ha
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Waiting on retest result.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The test team will re-test as per Eric's suggestion. Eric will do the initial investigation when the issue is reproduced.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

It was tested on system without vlan and different issue https://bugs.launchpad.net/starlingx/+bug/1847657

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Debug on-hold until patch testing of surrogate issue https://bugs.launchpad.net/starlingx/+bug/1847657 is complete.

Revision history for this message
Bill Zvonar (billzvonar) wrote :

Jeyan, can you re-test this LP as well now please.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Verified in load 2019-10-17_20-00-00" in wolfpass 8-12 . Could not reproduce this issue.

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Closing since Jeyan confirmed that the issue is not reproducible. Thanks Jeyan!

Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.