Activity log for bug #1844717

Date Who What changed Old value New value Message
2019-09-19 18:58:34 Anujeyan Manokeran bug added bug
2019-09-19 19:20:30 Anujeyan Manokeran description Brief Description ----------------- Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state . system host-if-list controller-1 +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ | uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes | +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ | 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0'] | [] | MTU=1500 | | 24cd614e-bc22-46f1-802f-66a8be6d9b19 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0'] | [] | MTU=1500 | | 29779448-8151-45b3-b229-9d78bfc35159 | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] | [] | [u'mgmt0', | MTU=9216 | | | | | | | | | u'cluster0'] | | | | | | | | | | | | | 5ff56ca6-e675-492b-9704-4521afbad331 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True | | 62749374-4856-4a94-8e78-af35834b6561 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 | +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ fm alarm-list +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-09-19T17: | | | | | | 55:22.213177 | | | | | | | | 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 2607:4100:2:ff::2 | | 55:22.208122 | | | | | | | | 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::9fcb:848 | | 55:22.203294 | | | | | | | | 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::9538:7910 | | 55:22.201814 | | | | | | | | 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::d173:b566 | | 43:42.434742 | | | | | | | | 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: | | | | | | 28:37.345704 | | | | | | | | 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: | | | | | | 28:37.340704 | | | | | | | | 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: | | | | | | 28:37.285653 | | | | | | | | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: | | | | | | 28:37.180633 | | | | | | | | 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: | | | | | | 28:36.274734 | | | | | | | | 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: | | | failure. | network=Cluster-host | | 28:33.043678 | | | | | | | | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: | | | only 1 active member available | controller. | | 26:37.078781 | | | | service_group= | | | | | | directory-services | | | | | | | | | | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: | | | active member available | controller. | | 26:37.070086 | | | | service_group=web- | | | | | | services | | | | | | | | | | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: | | | only 1 active member available | controller. | | 26:37.064781 | | | | service_group=storage- | | | | | | services | | | | | | | | | | 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T02: | | | | 2607:4100:2:ff::2 | | 03:43.611747 | | | | | | | +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms 64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms 64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms system host-list +----+--------------+-------------+----------------+-------------+--------------+ | id | hostname | personality | administrative | operational | availability | +----+--------------+-------------+----------------+-------------+--------------+ | 1 | controller-0 | controller | unlocked | enabled | available | | 2 | compute-2 | worker | unlocked | enabled | available | | 3 | controller-1 | controller | unlocked | disabled | failed | | 4 | compute-0 | worker | unlocked | enabled | available | | 5 | compute-1 | worker | unlocked | enabled | available | +----+--------------+-------------+----------------+-------------+--------------+ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1 ping: controller-1: Name or service not known [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms --- controller-1-cluster-host ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 4999ms rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms Severity -------- Major Steps to Reproduce ------------------ 1. Verify health of the system. Verify for any alarms. 2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management. 3. Wait for 30 seconds and put back the cable. 4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever. System Configuration -------------------- Regular system with IPv6 configuration wolfpass-08-12 Expected Behavior ------------------ Failure recovered after cable was put back. Actual Behavior ---------------- As per description never recovered Reproducibility --------------- Able Test only once lab was not recovered . Reinstall was required. Load ---- Build was on 2019-09-18_20-00-00 Last Pass --------- Timestamp/Logs -------------- 2019-09-19T17:28:33. Test Activity ------------- Regression test Brief Description -----------------    Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state . system host-if-list controller-1 +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ | uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes | +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ | 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 | mgmt0 | platform | vlan | 186 | [] | [u'pxeboot0'] | [] | MTU=1500 | | 24cd614e-bc22-46f1-802f-66a8be6d9b19 | cluster0 | platform | vlan | 187 | [] | [u'pxeboot0'] | [] | MTU=1500 | | 29779448-8151-45b3-b229-9d78bfc35159 | pxeboot0 | platform | ethernet | None | [u'enp24s0f0'] | [] | [u'mgmt0', | MTU=9216 | | | | | | | | | u'cluster0'] | | | | | | | | | | | | | 5ff56ca6-e675-492b-9704-4521afbad331 | data0 | data | ethernet | None | [u'enp175s0f0'] | [] | [] | MTU=1500,accelerated=True | | 62749374-4856-4a94-8e78-af35834b6561 | oam0 | platform | ethernet | None | [u'eno1'] | [] | [] | MTU=1500 | +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+   fm alarm-list +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-09-19T17: | | | | | | 55:22.213177 | | | | | | | | 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 2607:4100:2:ff::2 | | 55:22.208122 | | | | | | | | 100.114 | NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::9fcb:848 | | 55:22.203294 | | | | | | | | 100.114 | NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::9538:7910 | | 55:22.201814 | | | | | | | | 100.114 | NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T17: | | | | 64:ff9b::d173:b566 | | 43:42.434742 | | | | | | | | 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-09-19T17: | | | | | | 28:37.345704 | | | | | | | | 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-09-19T17: | | | | | | 28:37.340704 | | | | | | | | 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-09-19T17: | | | | | | 28:37.285653 | | | | | | | | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-19T17: | | | | | | 28:37.180633 | | | | | | | | 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-09-19T17: | | | | | | 28:36.274734 | | | | | | | | 200.009 | controller-1 experienced a persistent critical 'Cluster-host Network' communication | host=controller-1. | critical | 2019-09-19T17: | | | failure. | network=Cluster-host | | 28:33.043678 | | | | | | | | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: | | | only 1 active member available | controller. | | 26:37.078781 | | | | service_group= | | | | | | directory-services | | | | | | | | | | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 | service_domain= | major | 2019-09-19T17: | | | active member available | controller. | | 26:37.070086 | | | | service_group=web- | | | | | | services | | | | | | | | | | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but | service_domain= | major | 2019-09-19T17: | | | only 1 active member available | controller. | | 26:37.064781 | | | | service_group=storage- | | | | | | services | | | | | | | | | | 100.114 | NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-19T02: | | | | 2607:4100:2:ff::2 | | 03:43.611747 | | | | | | | +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms 64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms 64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms system host-list +----+--------------+-------------+----------------+-------------+--------------+ | id | hostname | personality | administrative | operational | availability | +----+--------------+-------------+----------------+-------------+--------------+ | 1 | controller-0 | controller | unlocked | enabled | available | | 2 | compute-2 | worker | unlocked | enabled | available | | 3 | controller-1 | controller | unlocked | disabled | failed | | 4 | compute-0 | worker | unlocked | enabled | available | | 5 | compute-1 | worker | unlocked | enabled | available | +----+--------------+-------------+----------------+-------------+--------------+ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1 ping: controller-1: Name or service not known [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms --- controller-1-cluster-host ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 4999ms rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms Severity -------- Major Steps to Reproduce ------------------ 1. Verify health of the system. Verify for any alarms. 2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management. 3. Wait for 30 seconds and put back the cable. 4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever. System Configuration -------------------- AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12 Expected Behavior ------------------ Failure recovered after cable was put back. Actual Behavior ---------------- As per description never recovered Reproducibility --------------- Able Test only once lab was not recovered . Reinstall was required. Load ---- Build was on 2019-09-18_20-00-00 Last Pass --------- Timestamp/Logs -------------- 2019-09-19T17:28:33. Test Activity ------------- Regression test
2019-09-19 19:25:00 Anujeyan Manokeran attachment added collect logs https://bugs.launchpad.net/starlingx/+bug/1844717/+attachment/5289785/+files/ALL_NODES_20190919.181541.tar
2019-09-20 18:40:02 Ghada Khalil summary AIO-DX+ worker:After cable pull test cluster network failure alarm was never cleared when cluster network is back and ping able AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state
2019-09-20 18:40:14 Ghada Khalil tags stx.3.0 stx.ha
2019-09-20 18:41:18 Ghada Khalil bug added subscriber Bill Zvonar
2019-09-20 18:41:23 Ghada Khalil starlingx: importance Undecided Medium
2019-09-20 18:41:25 Ghada Khalil starlingx: status New Triaged
2019-09-20 18:41:35 Ghada Khalil starlingx: assignee Bin Qian (bqian20)
2019-09-27 20:03:06 Bin Qian starlingx: assignee Bin Qian (bqian20)
2019-09-27 20:03:13 Bin Qian starlingx: assignee Bin Qian (bqian20)
2019-10-02 17:46:56 Yang Liu tags stx.3.0 stx.ha stx.3.0 stx.ha stx.retestneeded
2019-10-08 14:53:03 Ghada Khalil starlingx: assignee Bin Qian (bqian20) Eric MacDonald (rocksolidmtce)
2019-10-08 14:53:18 Ghada Khalil tags stx.3.0 stx.ha stx.retestneeded stx.3.0 stx.metal stx.retestneeded
2019-10-25 19:43:35 Anujeyan Manokeran tags stx.3.0 stx.metal stx.retestneeded stx.3.0 stx.metal
2019-10-25 20:01:40 Ghada Khalil starlingx: status Triaged Invalid