unexpected swact occurred after host-swact completed due to ceph-mon issue

Bug #1842966 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Daniel Badea

Bug Description

Brief Description
-----------------
After host-swact cmd execution completed, alarm 400.002 | Service group controller-services has no active members available; expected 1 active member raied, and un unexpected swact occured.

Severity
--------
Critical

Steps to Reproduce
------------------
host-swact
check fm alarm-list

TC-name: z_containers/test_custom_containers.py::test_upload_charts_via_helm_upload

Expected Behavior
------------------
no unexpected swact

Actual Behavior
----------------
unexpected swact occured

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: WCP_76-77

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-09-05_00-10-00

Last Pass
---------
2019-09-03_00-10-00

Timestamp/Logs
--------------
[2019-09-05 17:54:19,018] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'

[2019-09-05 17:58:02,286] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-09-05 17:58:03,661] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------+----------+----------------------------+
| e1cdca69-48f8-47ab-9b19-075980764dc1 | 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-09-05T17:55:09.620802 |
| 3bd10bb5-d210-446e-9429-2d1be70efa58 | 100.114 | NTP address 2607:5300:201:3 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:201:3 | minor | 2019-09-05T17:41:01.187069 |
| 2c68c9ae-cf55-494a-a449-76b80445a288 | 100.114 | NTP address 199.19.167.36 is not a valid or a reachable NTP server. | host=controller-1.ntp=199.19.167.36 | minor | 2019-09-05T17:41:01.172656 |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------+----------+----------------------------+
controller-0:~$
[2019-09-05 17:58:03,661] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-09-05 17:58:03,764] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2019-09-05 17:58:08,769] 1534 DEBUG MainThread ssh.get_active_controller:: Getting active controller client for wcp_76_77
[2019-09-05 17:58:08,770] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-09-05 17:58:08,770] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-09-05 17:59:26,256] 387 WARNING MainThread ssh.expect :: EOF caught.
[2019-09-05 17:59:26,785] 60 DEBUG MainThread conftest.update_results:: ***Failure at test teardown: /usr/local/lib/python3.4/site-packages/pexpect/expect.py:49: pexpect.exceptions.EOF: End Of File (EOF). Exception style platform.
***Details: self = <pexpect.pxssh.pxssh object at 0x7f0e2669d278>, size = 100000

[2019-09-05 17:59:30,537] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.150.5
[2019-09-05 17:59:32,033] 301 DEBUG MainThread ssh.send :: Send ''
[2019-09-05 17:59:32,141] 423 DEBUG MainThread ssh.expect :: Output:
controller-1:~$
[2019-09-05 17:59:32,141] 183 INFO MainThread ssh.connect :: Login successful!

[2019-09-05 17:59:38,877] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-09-05 17:59:41,688] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| 741c611c-52d9-41e3-ba41-e24e3aba7f58 | 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=controller-services | major | 2019-09-05T17:59:34.555618 |
| 2ef8b4d6-721c-48d7-a7b4-8d564f8f6462 | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-09-05T17:59:31.553536 |
| 9c08c0ca-4a96-4be1-8bac-25e6d2d358ba | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-09-05T17:59:29.727512 |
| 4d5a0a49-e80b-4fc1-bb1d-51ee1ab7aba1 | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-09-05T17:59:29.710517 |
| fe1ef010-18c7-4d24-adb4-40b17bf7e98b | 400.001 | Service group controller-services failure; ceph-mon(disabled, ) | service_domain=controller.service_group=controller-services.host=controller-0 | critical | 2019-09-05T17:59:29.680300 |
| 3bd10bb5-d210-446e-9429-2d1be70efa58 | 100.114 | NTP address 2607:5300:201:3 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:201:3 | minor | 2019-09-05T17:41:01.187069 |
| 2c68c9ae-cf55-494a-a449-76b80445a288 | 100.114 | NTP address 199.19.167.36 is not a valid or a reachable NTP server. | host=controller-1.ntp=199.19.167.36 | minor | 2019-09-05T17:41:01.172656 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
controller-1:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
summary: - unexpected swact occured after host-swact completed
+ unexpected swact occurred after host-swact completed
Revision history for this message
Bin Qian (bqian20) wrote : Re: unexpected swact occurred after host-swact completed
Download full text (9.3 KiB)

Ceph-mon had failed the active controller, causing swact rollback.
| 2019-09-05T17:55:12.677 | 565 | service-scn | hw-mon | enabling | enabled-active | enable success
| 2019-09-05T17:56:02.854 | 566 | service-scn | ceph-mon | enabling | disabling | enable failed
| 2019-09-05T17:58:03.580 | 567 | service-scn | ceph-mon | disabling | disabling-failed | disable failed
| 2019-09-05T17:58:03.581 | 568 | service-group-scn | controller-services | go-active | go-active-failed | ceph-mon(disabling, failed)
| 2019-09-05T17:58:04.477 | 569 | service-scn | ceph-mon | disabling-failed | enabling-failed | enabled-active state requested
| 2019-09-05T17:58:07.793 | 570 | service-group-scn | vim-services | active | disabling |
| 2019-09-05T17:58:07.794 | 571 | service-group-scn | cloud-services | active | disabling |
| 2019-09-05T17:58:07.795 | 572 | service-group-scn | controller-services | go-active-failed | disabling-failed | ceph-mon(enabling, failed)
| 2019-09-05T17:58:07.796 | 573 | service-group-scn | oam-services | active | disabling |

2019-09-05T17:55:02.000 controller-0 sm: debug time[7529.198] log<1426> INFO: sm[107278]: sm_service_enable.c(461): Started enable action (1008359) for service (ceph-mon).
2019-09-05T17:55:02.000 controller-0 sm: debug time[7529.198] log<1427> INFO: sm[107278]: sm_service_fsm.c(1032): Service (ceph-mon) received event (enable-throttle) was in the enabling-throttle state and is now in the enabling state.
2019-09-05T17:55:32.000 controller-0 sm: debug time[7559.216] log<1572> INFO: sm[107278]: sm_service_enable.c(245): Action (enable) timeout with result (failed), state (unknown), status (unknown), and condition (unknown) for service (ceph-mon), reason_text=, exit_code=-65534.
2019-09-05T17:55:32.000 controller-0 sm: debug time[7559.216] log<1573> INFO: sm[107278]: sm_service_action.c(345): Aborting service (ceph-mon) with kill signal, pid=1008359.
2019-09-05T17:55:32.000 controller-0 sm: debug time[7559.216] log<1574> INFO: sm[107278]: sm_service_enable.c(154): Max retires not met for action (enable) of service (ceph-mon), attempts=1.
2019-09-05T17:55:32.000 controller-0 sm: debug time[7559.216] log<1575> INFO: sm[107278]: sm_service_enable.c(461): Started enable action (1013901) for service (ceph-mon).
2019-09-05T17:56:02.000 controller-0 sm: debug time[7589.223] log<1577> INFO: sm[107278]: sm_service_enable.c(245): Action (enable) timeout with result (failed), state (unknown), status (unknown), and condition (unknown) for service (ceph-mon), reason_text=, ...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Sounds the same as https://bugs.launchpad.net/starlingx/+bug/1840176
Assigning to the developer to confirm and mark as duplicate if appropriate.

summary: - unexpected swact occurred after host-swact completed
+ unexpected swact occurred after host-swact completed due to ceph-mon
+ issue
tags: added: stx.3.0 stx.storage
tags: added: stx.retestneeded
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Daniel Badea (daniel.badea)
Revision history for this message
Daniel Badea (daniel.badea) wrote :

Found the same log "signature" present in https://bugs.launchpad.net/starlingx/+bug/1840176
(ceph start mon command issued repeatedly with no feedback about current status)

controller-0:

2019-09-05 17:55:02.841 /etc/init.d/ceph-init-wrapper - INFO: Ceph START mon command received
2019-09-05 17:55:26.510 /etc/init.d/ceph mgr.controller-0 WARN: /var/lib/ceph/mgr/ceph-controller-0/sysvinit file is missing
2019-09-05 17:55:26.830 /etc/init.d/ceph mon.controller INFO: Process STARTED successfully, waiting for it to become OPERATIONAL
2019-09-05 17:55:32.853 /etc/init.d/ceph-init-wrapper - INFO: Ceph START mon command received
2019-09-05 17:56:02.861 /etc/init.d/ceph-init-wrapper - INFO: Ceph STOP mon command received.
2019-09-05 17:56:32.879 /etc/init.d/ceph-init-wrapper - INFO: Ceph STOP mon command received.
2019-09-05 17:57:03.203 /etc/init.d/ceph-init-wrapper - INFO: Ceph STOP mon command received.
2019-09-05 17:57:33.576 /etc/init.d/ceph-init-wrapper - INFO: Ceph STOP mon command received.

in-progress swact is cancelled because there's no indication of ceph-mon being started (altough it is running):

2019-09-05T17:54:38.186 | 428 | node-scn | controller-0 | | swact | issued against host controller-1
2019-09-05T17:55:01.482 | 433 | service-group-scn | controller-services | standby | go-active |
2019-09-05T17:55:01.828 | 439 | service-scn | ceph-mon | disabled | enabling-throttle | enabled-active state requested
2018-09-05 17:55:26.897 7f7291556140 0 ceph version 13.2.2 (00071a2d9e839b95f9439daaccd4677c5d15eaa6) mimic (stable), process ceph-mon, pid 1012721
2019-09-05T17:56:02.854 | 566 | service-scn | ceph-mon | enabling | disabling | enable failed

Duplicate of #1840176

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug has been addressed via: https://review.opendev.org/678474
Merged on 2019-09-09

Marking this as Fix Released as well

Changed in starlingx:
status: Triaged → Fix Released
Peng Peng (ppeng)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.