AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Invalid
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
Cable pull test(OAM +cluster+
system host-if-list controller-1
+------
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+------
| 221e37b9-
| 24cd614e-
| 29779448-
| | | | | | | | u'cluster0'] | |
| | | | | | | | | |
| 5ff56ca6-
| 62749374-
+------
fm alarm-list
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller
| | | | | 55:22.213177 |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller
| | | 2607:4100:2:ff::2 | | 55:22.208122 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller
| | | 64:ff9b::9fcb:848 | | 55:22.203294 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller
| | | 64:ff9b::9538:7910 | | 55:22.201814 |
| | | | | |
| 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller
| | | 64:ff9b::d173:b566 | | 43:42.434742 |
| | | | | |
| 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: |
| | | | | 28:37.345704 |
| | | | | |
| 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: |
| | | | | 28:37.340704 |
| | | | | |
| 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: |
| | | | | 28:37.285653 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: |
| | | | | 28:37.180633 |
| | | | | |
| 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: |
| | | | | 28:36.274734 |
| | | | | |
| 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: |
| | failure. | network=
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.078781 |
| | | service_group= | | |
| | | directory-services | | |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: |
| | active member available | controller. | | 26:37.070086 |
| | | service_group=web- | | |
| | | services | | |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.064781 |
| | | service_
| | | services | | |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller
| | | 2607:4100:2:ff::2 | | 03:43.611747 |
| | | | | |
+------
ping6 controller-1
PING controller-
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms
64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms
system host-list
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-2 | worker | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | disabled | failed |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 5 | compute-1 | worker | unlocked | enabled | available |
+----+-
[sysadmin@
[sysadmin@
[sysadmin@
ping: controller-1: Name or service not known
[sysadmin@
PING controller-
64 bytes from controller-
64 bytes from controller-
64 bytes from controller-
64 bytes from controller-
64 bytes from controller-
64 bytes from controller-
--- controller-
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.073/0.
[sysadmin@
PING controller-
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms
Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management.
3. Wait for 30 seconds and put back the cable.
4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever.
System Configuration
-------
AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12
Expected Behavior
------------------
Failure recovered after cable was put back.
Actual Behavior
----------------
As per description never recovered
Reproducibility
---------------
Able Test only once lab was not recovered . Reinstall was required.
Load
----
Build was on 2019-09-18_20-00-00
Last Pass
---------
Timestamp/Logs
--------------
2019-09-
Test Activity
-------------
Regression test
tags: | added: stx.retestneeded |
Changed in starlingx: | |
assignee: | Bin Qian (bqian20) → Eric MacDonald (rocksolidmtce) |
tags: |
added: stx.metal removed: stx.ha |
Marking as stx.3.0 / medium priority - standby controller does not recover after split brain/cable pull test