2019-09-19 18:58:34 |
Anujeyan Manokeran |
bug |
|
|
added bug |
2019-09-19 19:20:30 |
Anujeyan Manokeran |
description |
Brief Description
-----------------
Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state .
system host-if-list controller-1
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 24cd614e-bc22-46f1-802f-66a8be6d9b19 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 29779448-8151-45b3-b229-9d78bfc35159 | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] | [] | [u'mgmt0', | MTU=9216 |
| | | | | | | | u'cluster0'] | |
| | | | | | | | | |
| 5ff56ca6-e675-492b-9704-4521afbad331 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True |
| 62749374-4856-4a94-8e78-af35834b6561 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
fm alarm-list
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-09-19T17: |
| | | | | 55:22.213177 |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 2607:4100:2:ff::2 | | 55:22.208122 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9fcb:848 | | 55:22.203294 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9538:7910 | | 55:22.201814 |
| | | | | |
| 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::d173:b566 | | 43:42.434742 |
| | | | | |
| 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: |
| | | | | 28:37.345704 |
| | | | | |
| 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: |
| | | | | 28:37.340704 |
| | | | | |
| 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: |
| | | | | 28:37.285653 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: |
| | | | | 28:37.180633 |
| | | | | |
| 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: |
| | | | | 28:36.274734 |
| | | | | |
| 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: |
| | failure. | network=Cluster-host | | 28:33.043678 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.078781 |
| | | service_group= | | |
| | | directory-services | | |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: |
| | active member available | controller. | | 26:37.070086 |
| | | service_group=web- | | |
| | | services | | |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.064781 |
| | | service_group=storage- | | |
| | | services | | |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T02: |
| | | 2607:4100:2:ff::2 | | 03:43.611747 |
| | | | | |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms
64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms
system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-2 | worker | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | disabled | failed |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 5 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
ping: controller-1: Name or service not known
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host
PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms
--- controller-1-cluster-host ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms
Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management.
3. Wait for 30 seconds and put back the cable.
4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever.
System Configuration
--------------------
Regular system with IPv6 configuration wolfpass-08-12
Expected Behavior
------------------
Failure recovered after cable was put back.
Actual Behavior
----------------
As per description never recovered
Reproducibility
---------------
Able Test only once lab was not recovered . Reinstall was required.
Load
----
Build was on 2019-09-18_20-00-00
Last Pass
---------
Timestamp/Logs
--------------
2019-09-19T17:28:33.
Test Activity
-------------
Regression test |
Brief Description
-----------------
Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state .
system host-if-list controller-1
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
| 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 24cd614e-bc22-46f1-802f-66a8be6d9b19 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 29779448-8151-45b3-b229-9d78bfc35159 | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] | [] | [u'mgmt0', | MTU=9216 |
| | | | | | | | u'cluster0'] | |
| | | | | | | | | |
| 5ff56ca6-e675-492b-9704-4521afbad331 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True |
| 62749374-4856-4a94-8e78-af35834b6561 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+
fm alarm-list
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-09-19T17: |
| | | | | 55:22.213177 |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 2607:4100:2:ff::2 | | 55:22.208122 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9fcb:848 | | 55:22.203294 |
| | | | | |
| 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::9538:7910 | | 55:22.201814 |
| | | | | |
| 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T17: |
| | | 64:ff9b::d173:b566 | | 43:42.434742 |
| | | | | |
| 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: |
| | | | | 28:37.345704 |
| | | | | |
| 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: |
| | | | | 28:37.340704 |
| | | | | |
| 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: |
| | | | | 28:37.285653 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: |
| | | | | 28:37.180633 |
| | | | | |
| 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: |
| | | | | 28:36.274734 |
| | | | | |
| 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: |
| | failure. | network=Cluster-host | | 28:33.043678 |
| | | | | |
| 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.078781 |
| | | service_group= | | |
| | | directory-services | | |
| | | | | |
| 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: |
| | active member available | controller. | | 26:37.070086 |
| | | service_group=web- | | |
| | | services | | |
| | | | | |
| 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: |
| | only 1 active member available | controller. | | 26:37.064781 |
| | | service_group=storage- | | |
| | | services | | |
| | | | | |
| 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T02: |
| | | 2607:4100:2:ff::2 | | 03:43.611747 |
| | | | | |
+----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+
ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms
64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms
system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-2 | worker | unlocked | enabled | available |
| 3 | controller-1 | controller | unlocked | disabled | failed |
| 4 | compute-0 | worker | unlocked | enabled | available |
| 5 | compute-1 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$
[sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1
ping: controller-1: Name or service not known
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host
PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms
64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms
--- controller-1-cluster-host ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 4999ms
rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms
[sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1
PING controller-1(controller-1 (face::4)) 56 data bytes
64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms
64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms
64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms
64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms
64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms
Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management.
3. Wait for 30 seconds and put back the cable.
4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever.
System Configuration
--------------------
AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12
Expected Behavior
------------------
Failure recovered after cable was put back.
Actual Behavior
----------------
As per description never recovered
Reproducibility
---------------
Able Test only once lab was not recovered . Reinstall was required.
Load
----
Build was on 2019-09-18_20-00-00
Last Pass
---------
Timestamp/Logs
--------------
2019-09-19T17:28:33.
Test Activity
-------------
Regression test |
|
2019-09-19 19:25:00 |
Anujeyan Manokeran |
attachment added |
|
collect logs https://bugs.launchpad.net/starlingx/+bug/1844717/+attachment/5289785/+files/ALL_NODES_20190919.181541.tar |
|
2019-09-20 18:40:02 |
Ghada Khalil |
summary |
AIO-DX+ worker:After cable pull test cluster network failure alarm was never cleared when cluster network is back and ping able |
AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state |
|
2019-09-20 18:40:14 |
Ghada Khalil |
tags |
|
stx.3.0 stx.ha |
|
2019-09-20 18:41:18 |
Ghada Khalil |
bug |
|
|
added subscriber Bill Zvonar |
2019-09-20 18:41:23 |
Ghada Khalil |
starlingx: importance |
Undecided |
Medium |
|
2019-09-20 18:41:25 |
Ghada Khalil |
starlingx: status |
New |
Triaged |
|
2019-09-20 18:41:35 |
Ghada Khalil |
starlingx: assignee |
|
Bin Qian (bqian20) |
|
2019-09-27 20:03:06 |
Bin Qian |
starlingx: assignee |
Bin Qian (bqian20) |
|
|
2019-09-27 20:03:13 |
Bin Qian |
starlingx: assignee |
|
Bin Qian (bqian20) |
|
2019-10-02 17:46:56 |
Yang Liu |
tags |
stx.3.0 stx.ha |
stx.3.0 stx.ha stx.retestneeded |
|
2019-10-08 14:53:03 |
Ghada Khalil |
starlingx: assignee |
Bin Qian (bqian20) |
Eric MacDonald (rocksolidmtce) |
|
2019-10-08 14:53:18 |
Ghada Khalil |
tags |
stx.3.0 stx.ha stx.retestneeded |
stx.3.0 stx.metal stx.retestneeded |
|
2019-10-25 19:43:35 |
Anujeyan Manokeran |
tags |
stx.3.0 stx.metal stx.retestneeded |
stx.3.0 stx.metal |
|
2019-10-25 20:01:40 |
Ghada Khalil |
starlingx: status |
Triaged |
Invalid |
|