StarlingX

Bug #1844717
Activity log

Activity log for bug #1844717

Date	Who	What changed	Old value	New value	Message
2019-09-19 18:58:34	Anujeyan Manokeran	bug			added bug
2019-09-19 19:20:30	Anujeyan Manokeran	description	Brief Description ----------------- Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state . system host-if-list controller-1 +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ \| uuid \| name \| class \| type \| vlan id \| ports \| uses i/f \| used by i/f \| attributes \| +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ \| 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 \| mgmt0 \| platform \| vlan \| 186 \| [] \| [u'pxeboot0'] \| [] \| MTU=1500 \| \| 24cd614e-bc22-46f1-802f-66a8be6d9b19 \| cluster0 \| platform \| vlan \| 187 \| [] \| [u'pxeboot0'] \| [] \| MTU=1500 \| \| 29779448-8151-45b3-b229-9d78bfc35159 \| pxeboot0 \| platform \| ethernet \| None \| [u'enp24s0f0'] \| [] \| [u'mgmt0', \| MTU=9216 \| \| \| \| \| \| \| \| \| u'cluster0'] \| \| \| \| \| \| \| \| \| \| \| \| \| 5ff56ca6-e675-492b-9704-4521afbad331 \| data0 \| data \| ethernet \| None \| [u'enp175s0f0'] \| [] \| [] \| MTU=1500,accelerated=True \| \| 62749374-4856-4a94-8e78-af35834b6561 \| oam0 \| platform \| ethernet \| None \| [u'eno1'] \| [] \| [] \| MTU=1500 \| +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ fm alarm-list +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ \| Alarm ID \| Reason Text \| Entity ID \| Severity \| Time Stamp \| +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ \| 100.114 \| NTP configuration does not contain any valid or reachable NTP servers. \| host=controller-1.ntp \| major \| 2019-09-19T17: \| \| \| \| \| \| 55:22.213177 \| \| \| \| \| \| \| \| 100.114 \| NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 2607:4100:2:ff::2 \| \| 55:22.208122 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::9fcb:848 \| \| 55:22.203294 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::9538:7910 \| \| 55:22.201814 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. \| host=controller-0.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::d173:b566 \| \| 43:42.434742 \| \| \| \| \| \| \| \| 200.010 \| compute-1 access to board management module has failed. \| host=compute-1 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.345704 \| \| \| \| \| \| \| \| 200.010 \| compute-0 access to board management module has failed. \| host=compute-0 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.340704 \| \| \| \| \| \| \| \| 200.010 \| compute-2 access to board management module has failed. \| host=compute-2 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.285653 \| \| \| \| \| \| \| \| 200.010 \| controller-0 access to board management module has failed. \| host=controller-0 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.180633 \| \| \| \| \| \| \| \| 200.010 \| controller-1 access to board management module has failed. \| host=controller-1 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:36.274734 \| \| \| \| \| \| \| \| 200.009 \| controller-1 experienced a persistent critical 'Cluster-host Network' communication \| host=controller-1. \| critical \| 2019-09-19T17: \| \| \| failure. \| network=Cluster-host \| \| 28:33.043678 \| \| \| \| \| \| \| \| 400.002 \| Service group directory-services loss of redundancy; expected 2 active members but \| service_domain= \| major \| 2019-09-19T17: \| \| \| only 1 active member available \| controller. \| \| 26:37.078781 \| \| \| \| service_group= \| \| \| \| \| \| directory-services \| \| \| \| \| \| \| \| \| \| 400.002 \| Service group web-services loss of redundancy; expected 2 active members but only 1 \| service_domain= \| major \| 2019-09-19T17: \| \| \| active member available \| controller. \| \| 26:37.070086 \| \| \| \| service_group=web- \| \| \| \| \| \| services \| \| \| \| \| \| \| \| \| \| 400.002 \| Service group storage-services loss of redundancy; expected 2 active members but \| service_domain= \| major \| 2019-09-19T17: \| \| \| only 1 active member available \| controller. \| \| 26:37.064781 \| \| \| \| service_group=storage- \| \| \| \| \| \| services \| \| \| \| \| \| \| \| \| \| 100.114 \| NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. \| host=controller-0.ntp= \| minor \| 2019-09-19T02: \| \| \| \| 2607:4100:2:ff::2 \| \| 03:43.611747 \| \| \| \| \| \| \| +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms 64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms 64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms system host-list +----+--------------+-------------+----------------+-------------+--------------+ \| id \| hostname \| personality \| administrative \| operational \| availability \| +----+--------------+-------------+----------------+-------------+--------------+ \| 1 \| controller-0 \| controller \| unlocked \| enabled \| available \| \| 2 \| compute-2 \| worker \| unlocked \| enabled \| available \| \| 3 \| controller-1 \| controller \| unlocked \| disabled \| failed \| \| 4 \| compute-0 \| worker \| unlocked \| enabled \| available \| \| 5 \| compute-1 \| worker \| unlocked \| enabled \| available \| +----+--------------+-------------+----------------+-------------+--------------+ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1 ping: controller-1: Name or service not known [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms --- controller-1-cluster-host ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 4999ms rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms Severity -------- Major Steps to Reproduce ------------------ 1. Verify health of the system. Verify for any alarms. 2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management. 3. Wait for 30 seconds and put back the cable. 4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever. System Configuration -------------------- Regular system with IPv6 configuration wolfpass-08-12 Expected Behavior ------------------ Failure recovered after cable was put back. Actual Behavior ---------------- As per description never recovered Reproducibility --------------- Able Test only once lab was not recovered . Reinstall was required. Load ---- Build was on 2019-09-18_20-00-00 Last Pass --------- Timestamp/Logs -------------- 2019-09-19T17:28:33. Test Activity ------------- Regression test	Brief Description ----------------- Cable pull test(OAM +cluster+Management) on AIO-DX + worker nodes configuration was tested. Here cluster network and management network are provisioned in same interface enp24s0f0. Active controller-1 rebooted and controller-0 become active. Controller-1 was rebooted multiple times with the failure alarm on cluster network .But cluster network for controller-1 ping able from controller-0. So there was no communication issue but alarm was not cleared and controller-1 was in failed state . system host-if-list controller-1 +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ \| uuid \| name \| class \| type \| vlan id \| ports \| uses i/f \| used by i/f \| attributes \| +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ \| 221e37b9-0cb7-41bc-8e96-dea5f9716ad9 \| mgmt0 \| platform \| vlan \| 186 \| [] \| [u'pxeboot0'] \| [] \| MTU=1500 \| \| 24cd614e-bc22-46f1-802f-66a8be6d9b19 \| cluster0 \| platform \| vlan \| 187 \| [] \| [u'pxeboot0'] \| [] \| MTU=1500 \| \| 29779448-8151-45b3-b229-9d78bfc35159 \| pxeboot0 \| platform \| ethernet \| None \| [u'enp24s0f0'] \| [] \| [u'mgmt0', \| MTU=9216 \| \| \| \| \| \| \| \| \| u'cluster0'] \| \| \| \| \| \| \| \| \| \| \| \| \| 5ff56ca6-e675-492b-9704-4521afbad331 \| data0 \| data \| ethernet \| None \| [u'enp175s0f0'] \| [] \| [] \| MTU=1500,accelerated=True \| \| 62749374-4856-4a94-8e78-af35834b6561 \| oam0 \| platform \| ethernet \| None \| [u'eno1'] \| [] \| [] \| MTU=1500 \| +--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-----------------------+---------------------------+ fm alarm-list +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ \| Alarm ID \| Reason Text \| Entity ID \| Severity \| Time Stamp \| +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ \| 100.114 \| NTP configuration does not contain any valid or reachable NTP servers. \| host=controller-1.ntp \| major \| 2019-09-19T17: \| \| \| \| \| \| 55:22.213177 \| \| \| \| \| \| \| \| 100.114 \| NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 2607:4100:2:ff::2 \| \| 55:22.208122 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::9fcb:848 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::9fcb:848 \| \| 55:22.203294 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::9538:7910 is not a valid or a reachable NTP server. \| host=controller-1.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::9538:7910 \| \| 55:22.201814 \| \| \| \| \| \| \| \| 100.114 \| NTP address 64:ff9b::d173:b566 is not a valid or a reachable NTP server. \| host=controller-0.ntp= \| minor \| 2019-09-19T17: \| \| \| \| 64:ff9b::d173:b566 \| \| 43:42.434742 \| \| \| \| \| \| \| \| 200.010 \| compute-1 access to board management module has failed. \| host=compute-1 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.345704 \| \| \| \| \| \| \| \| 200.010 \| compute-0 access to board management module has failed. \| host=compute-0 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.340704 \| \| \| \| \| \| \| \| 200.010 \| compute-2 access to board management module has failed. \| host=compute-2 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.285653 \| \| \| \| \| \| \| \| 200.010 \| controller-0 access to board management module has failed. \| host=controller-0 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:37.180633 \| \| \| \| \| \| \| \| 200.010 \| controller-1 access to board management module has failed. \| host=controller-1 \| warning \| 2019-09-19T17: \| \| \| \| \| \| 28:36.274734 \| \| \| \| \| \| \| \| 200.009 \| controller-1 experienced a persistent critical 'Cluster-host Network' communication \| host=controller-1. \| critical \| 2019-09-19T17: \| \| \| failure. \| network=Cluster-host \| \| 28:33.043678 \| \| \| \| \| \| \| \| 400.002 \| Service group directory-services loss of redundancy; expected 2 active members but \| service_domain= \| major \| 2019-09-19T17: \| \| \| only 1 active member available \| controller. \| \| 26:37.078781 \| \| \| \| service_group= \| \| \| \| \| \| directory-services \| \| \| \| \| \| \| \| \| \| 400.002 \| Service group web-services loss of redundancy; expected 2 active members but only 1 \| service_domain= \| major \| 2019-09-19T17: \| \| \| active member available \| controller. \| \| 26:37.070086 \| \| \| \| service_group=web- \| \| \| \| \| \| services \| \| \| \| \| \| \| \| \| \| 400.002 \| Service group storage-services loss of redundancy; expected 2 active members but \| service_domain= \| major \| 2019-09-19T17: \| \| \| only 1 active member available \| controller. \| \| 26:37.064781 \| \| \| \| service_group=storage- \| \| \| \| \| \| services \| \| \| \| \| \| \| \| \| \| 100.114 \| NTP address 2607:4100:2:ff::2 is not a valid or a reachable NTP server. \| host=controller-0.ntp= \| minor \| 2019-09-19T02: \| \| \| \| 2607:4100:2:ff::2 \| \| 03:43.611747 \| \| \| \| \| \| \| +----------+-------------------------------------------------------------------------------------+------------------------+----------+----------------+ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.086 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.107 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.092 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.077 ms 64 bytes from controller-1 (face::4): icmp_seq=6 ttl=64 time=0.066 ms 64 bytes from controller-1 (face::4): icmp_seq=7 ttl=64 time=0.128 ms system host-list +----+--------------+-------------+----------------+-------------+--------------+ \| id \| hostname \| personality \| administrative \| operational \| availability \| +----+--------------+-------------+----------------+-------------+--------------+ \| 1 \| controller-0 \| controller \| unlocked \| enabled \| available \| \| 2 \| compute-2 \| worker \| unlocked \| enabled \| available \| \| 3 \| controller-1 \| controller \| unlocked \| disabled \| failed \| \| 4 \| compute-0 \| worker \| unlocked \| enabled \| available \| \| 5 \| compute-1 \| worker \| unlocked \| enabled \| available \| +----+--------------+-------------+----------------+-------------+--------------+ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ [sysadmin@controller-0 ~(keystone_admin)]$ ping controller-1 ping: controller-1: Name or service not known [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1-cluster-host PING controller-1-cluster-host(controller-1-cluster-host (feed:beef::4)) 56 data bytes 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=1 ttl=64 time=0.099 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=2 ttl=64 time=0.164 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=3 ttl=64 time=0.165 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=4 ttl=64 time=0.178 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=5 ttl=64 time=0.073 ms 64 bytes from controller-1-cluster-host (feed:beef::4): icmp_seq=6 ttl=64 time=0.098 ms --- controller-1-cluster-host ping statistics --- 6 packets transmitted, 6 received, 0% packet loss, time 4999ms rtt min/avg/max/mdev = 0.073/0.129/0.178/0.042 ms [sysadmin@controller-0 ~(keystone_admin)]$ ping6 controller-1 PING controller-1(controller-1 (face::4)) 56 data bytes 64 bytes from controller-1 (face::4): icmp_seq=1 ttl=64 time=0.100 ms 64 bytes from controller-1 (face::4): icmp_seq=2 ttl=64 time=0.138 ms 64 bytes from controller-1 (face::4): icmp_seq=3 ttl=64 time=0.128 ms 64 bytes from controller-1 (face::4): icmp_seq=4 ttl=64 time=0.097 ms 64 bytes from controller-1 (face::4): icmp_seq=5 ttl=64 time=0.104 ms Severity -------- Major Steps to Reproduce ------------------ 1. Verify health of the system. Verify for any alarms. 2. Pull both OAM and Cluster , MGT interface together. Interface enp24s0f0 was provisioned with cluster and Management. 3. Wait for 30 seconds and put back the cable. 4. Verify the swact and alarms. As per description controller-1 went to failed state and rebooted and cluster network failure alarm was never cleared. Controller-1 was in failed state forever. System Configuration -------------------- AIO-DX + worker nodes with IPv6 configuration wolfpass-08-12 Expected Behavior ------------------ Failure recovered after cable was put back. Actual Behavior ---------------- As per description never recovered Reproducibility --------------- Able Test only once lab was not recovered . Reinstall was required. Load ---- Build was on 2019-09-18_20-00-00 Last Pass --------- Timestamp/Logs -------------- 2019-09-19T17:28:33. Test Activity ------------- Regression test
2019-09-19 19:25:00	Anujeyan Manokeran	attachment added		collect logs https://bugs.launchpad.net/starlingx/+bug/1844717/+attachment/5289785/+files/ALL_NODES_20190919.181541.tar
2019-09-20 18:40:02	Ghada Khalil	summary	AIO-DX+ worker:After cable pull test cluster network failure alarm was never cleared when cluster network is back and ping able	AIO-DX+ worker: After cable pull test/split brain test, controller-1 remains in a failed state
2019-09-20 18:40:14	Ghada Khalil	tags		stx.3.0 stx.ha
2019-09-20 18:41:18	Ghada Khalil	bug			added subscriber Bill Zvonar
2019-09-20 18:41:23	Ghada Khalil	starlingx: importance	Undecided	Medium
2019-09-20 18:41:25	Ghada Khalil	starlingx: status	New	Triaged
2019-09-20 18:41:35	Ghada Khalil	starlingx: assignee		Bin Qian (bqian20)
2019-09-27 20:03:06	Bin Qian	starlingx: assignee	Bin Qian (bqian20)
2019-09-27 20:03:13	Bin Qian	starlingx: assignee		Bin Qian (bqian20)
2019-10-02 17:46:56	Yang Liu	tags	stx.3.0 stx.ha	stx.3.0 stx.ha stx.retestneeded
2019-10-08 14:53:03	Ghada Khalil	starlingx: assignee	Bin Qian (bqian20)	Eric MacDonald (rocksolidmtce)
2019-10-08 14:53:18	Ghada Khalil	tags	stx.3.0 stx.ha stx.retestneeded	stx.3.0 stx.metal stx.retestneeded
2019-10-25 19:43:35	Anujeyan Manokeran	tags	stx.3.0 stx.metal stx.retestneeded	stx.3.0 stx.metal
2019-10-25 20:01:40	Ghada Khalil	starlingx: status	Triaged	Invalid