unexpected swact happened after host-swact

Bug #1837461 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
After host swact was done, an unexpected host swact happened.

Severity
--------
Major

Steps to Reproduce
------------------
Upload helm charts via helm-upload cmd from active controller
Swact active controller and verify uploaded charts are synced over
host-swact

TC-name: z_containers/test_custom_containers.py::test_upload_charts_via_helm_upload

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 20190721T233000Z

Last Pass
---------
Lab: SM_3
Load: 20190721T233000Z

Timestamp/Logs
--------------
[2019-07-22 18:16:53,234] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'

[sysadmin@controller-1 ~(keystone_admin)]$
[2019-07-22 18:17:17,636] 301 DEBUG MainThread ssh.send :: Send ''
[2019-07-22 18:17:20,740] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.151.216 disconnected
[2019-07-22 18:17:20,740] 1563 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.151.216 OAM floating IP disconnected, indicating swact initiated.
[2019-07-22 18:17:50,758] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.151.216
[2019-07-22 18:17:52,163] 301 DEBUG MainThread ssh.send :: Send ''
[2019-07-22 18:17:52,265] 423 DEBUG MainThread ssh.expect :: Output:
controller-0:~$

[2019-07-22 18:18:18,248] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap'
[2019-07-22 18:18:19,655] 423 DEBUG MainThread ssh.expect :: Output:
+----------+-----------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-----------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-07-22T18:17:18.366921 |
| 400.002 | Service group storage-monitoring-services has no active members available; expected 1 active member | service_domain=controller.service_group=storage-monitoring-services | critical | 2019-07-22T18:17:17.854626 |
| 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-07-22T17:28:29.937004 |
+----------+-----------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
controller-0:~$

[2019-07-22 18:20:30,831] 483 WARNING MainThread ssh.exec_cmd:: Timeout exceeded.

[2019-07-22 18:20:46,718] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-22 18:20:47,696] 423 DEBUG MainThread ssh.expect :: Output:
Must provide Keystone credentials or user-defined endpoint and token, error was: Unable to establish connection to http://192.168.204.2:5000/v3/auth/tokens: HTTPConnectionPool(host='192.168.204.2', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa5f56e3910>: Failed to establish a new connection: [Errno 111] Connection refused',))
controller-1:~$

[2019-07-22 18:20:51,284] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-22 18:21:01,400] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| de47abb9-b624-41e0-86de-6eeb687ae358 | 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=controller-services | major | 2019-07-22T18:20:51.914146 |
| c6f879b4-f322-4fa1-a54f-a8b8daaf3650 | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-07-22T18:20:51.872017 |
| 171fca9f-c7d9-4698-a787-8a26de96401d | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-07-22T18:20:51.787936 |
| 0a51027f-4c39-405f-b832-7a6d3ae6dcd3 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-07-22T18:20:51.701538 |
| 0c26854a-b525-4dfe-939c-e149e65626e0 | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-07-22T18:20:45.861781 |
| 510f3d0c-3c0c-4da8-8c58-724be7100519 | 400.001 | Service group controller-services failure; ceph-mon(disabled, ) | service_domain=controller.service_group=controller-services.host=controller-0 | critical | 2019-07-22T18:20:45.693684 |
| 3974a8df-3f91-49de-a943-bbfd4db578a6 | 400.002 | Service group storage-monitoring-services has no active members available; expected 1 active member | service_domain=controller.service_group=storage-monitoring-services | critical | 2019-07-22T18:20:45.686474 |
| b0f0ad7a-e220-4d9b-b1cd-7f3b8e1d233b | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2019-07-22T17:28:29.937004 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bin Qian to triage

tags: added: stx.ha
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :

swact was terminated due to ceph-mon failed to go enabled on controller-0.

| 2019-07-22T18:18:13.607 | 650 | service-scn | ceph-mon | enabling | disabling | enable failed
| 2019-07-22T18:18:13.633 | 651 | service-scn | ceph-radosgw | enabling-throttle | disabling | disabled state requested
| 2019-07-22T18:18:13.778 | 652 | service-scn | ceph-radosgw | disabling | disabled | disable success
| 2019-07-22T18:19:14.235 | 653 | service-scn | ceph-mon | disabling | disabling-failed | disable failed
| 2019-07-22T18:19:14.235 | 654 | service-group-scn | controller-services | go-active | go-active-failed | ceph-mon(disabling, failed)
| 2019-07-22T18:19:15.466 | 655 | service-scn | ceph-mon | disabling-failed | enabling-failed | enabled-active state requested
| 2019-07-22T18:19:15.466 | 656 | service-scn | ceph-radosgw | disabled | enabling-throttle | enabled-active state requested
| 2019-07-22T18:19:18.946 | 657 | service-group-scn | vim-services | active | disabling |
| 2019-07-22T18:19:18.947 | 658 | service-group-scn | cloud-services | active | disabling |
| 2019-07-22T18:19:18.948 | 659 | service-group-scn | controller-services | go-active-failed | disabling-failed | ceph-mon(enabling, failed)

duplicated to https://bugs.launchpad.net/starlingx/+bug/1836075

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug #1836075 was resolved by:
https://review.opendev.org/672708
https://review.opendev.org/672709

Marking as Fix Released

Changed in starlingx:
status: New → Fix Released
tags: added: stx.2.0 stx.storage
removed: stx.ha
Changed in starlingx:
importance: Undecided → Medium
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.