After host-swact completed, unexpected swact occurred due to ceph error

Bug #1840176 reported by Peng Peng
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Daniel Badea

Bug Description

Brief Description
-----------------
5 mins after host-swact completed, an unexpected swact occurred and alarm-list showed 270.001 "Host controller-1 compute services failure"

Severity
--------
Major

Steps to Reproduce
------------------
host-swact
run some reading cmd

TC-name: mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
no unexpected swact happened

Actual Behavior
----------------
unexpected swact happened

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-13_20-59-00

Last Pass
---------
Lab: WCP_63_66
Load: 2019-08-13_20-59-00

Timestamp/Logs
--------------
[2019-08-14 08:52:10,857] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[sysadmin@controller-0 ~(keystone_admin)]$
[2019-08-14 08:52:35,441] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 08:52:38,545] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.151.216 disconnected
[2019-08-14 08:52:38,545] 1564 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.151.216 OAM floating IP disconnected, indicating swact initiated.
[2019-08-14 08:53:08,575] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.151.216
[2019-08-14 08:53:09,956] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 08:53:10,059] 423 DEBUG MainThread ssh.expect :: Output:
controller-1:~$

controller-1:~$
[2019-08-14 08:56:09,694] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-14 08:56:09,695] 301 DEBUG MainThread ssh.send :: Send 'stat /etc/platform/.task_affining_incomplete'
[2019-08-14 08:57:09,812] 394 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.

[2019-08-14 09:01:50,420] 793 DEBUG MainThread ssh.close :: connection closed. host: 128.224.151.216, user: sysadmin. Object ID: 140532888100312
[2019-08-14 09:01:50,420] 248 DEBUG MainThread ssh.connect :: Retry in 10 seconds
[2019-08-14 09:02:01,700] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 09:02:01,803] 423 DEBUG MainThread ssh.expect :: Output:
controller-0:~$

[2019-08-14 09:02:16,375] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:02:17,752] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| f900135a-72ca-42b1-a8dc-00a2c76e52e7 | 270.001 | Host controller-1 compute services failure, | host=controller-1.services=compute | critical | 2019-08-14T09:02:12.648454 |
| 752dc3ff-4e14-4acc-8482-b4de9c90cd03 | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. | host=controller-1 | critical | 2019-08-14T09:02:08.649239 |
| a8b4dc4e-429c-4650-86ca-bd542b56aed6 | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-08-14T09:02:01.418293 |
| a6c0928c-b82f-4145-bab6-329dcd84a2a3 | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=directory-services | major | 2019-08-14T09:02:01.337270 |
| c06d1156-8220-4b84-82a5-5e23798d3154 | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=web-services | major | 2019-08-14T09:02:01.256242 |
| 6edce745-1622-4afc-b64a-26ecbbf08faf | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=storage-services | major | 2019-08-14T09:02:01.175300 |
| d88e327a-861a-425a-81cc-b78cc2a2b994 | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-08-14T09:02:00.501254 |
| 700abbc8-c44f-495a-b08a-ad655d23982a | 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=patching-services | major | 2019-08-14T09:02:00.420246 |
| fe6b6b82-eb2d-4919-9673-86bcd39e91cc | 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=storage-monitoring-services | major | 2019-08-14T09:02:00.096293 |
| 0bac73c7-0207-4140-a382-6011c8c68dc7 | 400.005 | Communication failure detected with peer over port enp11s0f1 on host controller-0 | host=controller-0.network=mgmt | major | 2019-08-14T09:02:00.015249 |
| 2b2b2bf2-47ae-4b75-8bbc-81815347613e | 400.005 | Communication failure detected with peer over port enp11s0f0 on host controller-0 | host=controller-0.network=oam | major | 2019-08-14T09:01:59.934269 |
| c570309c-2e7e-48ef-aac6-69ea24d7980f | 400.005 | Communication failure detected with peer over port ens3f0 on host controller-0 | host=controller-0.network=cluster-host | major | 2019-08-14T09:01:59.853257 |
| 001ca8f1-fee3-4f3d-ba78-7b935518a8ea | 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-08-14T09:01:59.611279 |
| 74ec6293-a70a-4c24-97e5-1fb5bde6549a | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-08-14T09:01:59.449250 |
| 113a5ed5-3971-43c1-b5c4-7f084e17c50d | 100.114 | NTP address 2607:5300:60:97 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:60:97 | minor | 2019-08-14T08:39:57.707422 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bin to triage before deciding on priority / release gate.

tags: added: stx.ha stx.sanity
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Bin Qian (bqian20) wrote :
Download full text (5.1 KiB)

Logs show that the controller-1 was shutdown, due to ceph-mon fatal failure (exceed max transition failures). Then controller-0 went active as expected.

| 2019-08-14T08:54:00.609 | 501 | service-scn | ceph-mon | enabling | disabling | enable failed
| 2019-08-14T08:56:01.831 | 502 | service-scn | ceph-mon | disabling | disabling-failed | disable failed
| 2019-08-14T08:56:01.831 | 503 | service-group-scn | controller-services | go-active | go-active-failed | ceph-mon(disabling, failed)
| 2019-08-14T08:56:02.256 | 504 | service-scn | ceph-mon | disabling-failed | enabling-failed | enabled-active state requested
| 2019-08-14T08:56:05.362 | 505 | service-group-scn | vim-services | active | disabling |
| 2019-08-14T08:56:05.363 | 506 | service-group-scn | cloud-services | active | disabling |
| 2019-08-14T08:56:05.364 | 507 | service-group-scn | controller-services | go-active-failed | disabling-failed | ceph-mon(enabling, failed)

2019-08-14T09:01:13.000 controller-1 sm: debug time[1513.951] log<1946> INFO: sm[97172]: sm_service_audit.c(364): Service (dbmon) already running audit-disabled action.
2019-08-14T09:01:13.000 controller-1 sm: debug time[1514.051] log<1947> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:14.000 controller-1 sm: debug time[1514.555] log<1948> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:14.000 controller-1 sm: debug time[1515.056] log<1949> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.558] log<1950> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.917] log<1951> INFO: sm[97172]: sm_service_group_api.c(212): Service group (controller-services) recovery from fatal condition escalated to a reboot.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.917] log<1952> INFO: sm[97172]: sm_node_api.cpp(964): Reboot of controller-1 requested, reason=service group (controller-services) recovery from fatal condition escalated to a reboot..
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.918] log<1953> INFO: sm[97172]: sm_troubleshoot.c(150): Troubleshoot process (401470) created.

ceph-mon errors:
2019-08-14T09:00:01.762 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:01.768 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still m...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/676689
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=f98b388a745f95737bd4c05c24d59aed8b5e3699
Submitter: Zuul
Branch: master

commit f98b388a745f95737bd4c05c24d59aed8b5e3699
Author: Bart Wensley <email address hidden>
Date: Thu Aug 15 07:41:01 2019 -0500

    Correct VIM host audit criteria for failing instances

    The VIM's host audit will fail instances on any host that is
    "not enabled". That includes hosts where the operational state
    is unknown.

    Updating the check to ensure the host is "disabled" not that it
    is "not enabled" to avoid failing instances on a host where we
    don't know the operational state.

    Change-Id: I68d3e9f63695de721c10fb1dd2b7ac5917cb50fa
    Closes-Bug: 1840176
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status: New → Fix Released
Revision history for this message
Bart Wensley (bartwensley) wrote : Re: After host-swact completed, unexpected swact occurred

Woops - the above commit referenced the wrong bug. Please ignore.

Dariush Eslimi (deslimi)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority -- appears to be a single occurrence due to a ceph failure

summary: - After host-swact completed, unexpected swact occurred
+ After host-swact completed, unexpected swact occurred due to ceph error
tags: added: stx.3.0 stx.storage
removed: stx.ha
Changed in starlingx:
importance: Undecided → Medium
assignee: Bin Qian (bqian20) → Daniel Badea (daniel.badea)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/678474
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9faad457030dc7ffd88f2e4106b01157c9b9a0ed
Submitter: Zuul
Branch: master

commit 9faad457030dc7ffd88f2e4106b01157c9b9a0ed
Author: Daniel Badea <email address hidden>
Date: Mon Aug 26 06:59:28 2019 +0000

    ceph-init-wrapper use flock instead of flag files

    When swact occurs and ceph-init-wrapper is slow to respond
    to a status request it gets killed by SM. This means the
    corresponding flag file that marks status in progress is left
    behind.

    When controller swacts back ceph-init-wrapper sees status
    in progress and waits for it to finish (with a timeout).
    Because it does not respond fast enough SM tries to start
    again ceph-init-wrapper to get ceph-mon service up and running.

    This happens a couple of times until the service is declared
    failed and controller swacts back.

    To fix this we need to use flock instead of flag files as the
    locks will be automatically released by the OS when process
    is killed.

    Change-Id: If1912e8575258a4f79321d8435c8ae1b96b78b98
    Closes-bug: 1840176
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.8 KiB)

Issue was reproduced on
Lab: WCP_76_77
Load: 2019-09-09_00-10-00
Job: Titanium_R6_build

[2019-09-10 08:22:07,526] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'

[sysadmin@controller-1 ~(keystone_admin)]$
[2019-09-10 08:22:31,455] 301 DEBUG MainThread ssh.send :: Send ''
[2019-09-10 08:22:34,559] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.150.5 disconnected
[2019-09-10 08:22:34,560] 1572 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.150.5 OAM floating IP disconnected, indicating swact initiated.
[2019-09-10 08:23:04,590] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.150.5
[2019-09-10 08:23:06,099] 301 DEBUG MainThread ssh.send :: Send ''
[2019-09-10 08:23:06,204] 423 DEBUG MainThread ssh.expect :: Output:
controller-0:~$

[2019-09-10 08:25:51,734] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-09-10 08:25:53,012] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| 32d95c7d-f2d7-4c51-917c-9a122b6a6923 | 400.001 | Service group controller-services failure; ceph-mon(disabling, failed) | service_domain=controller.service_group=controller-services.host=controller-0 | critical | 2019-09-10T08:25:51.606031 |
| 2177bd57-2a1c-43fd-b54f-41931c6221bc | 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-09-10T08:22:30.015971 |
| 4c643552-1006-48a2-8234-d6d6d3f3fbf7 | 100.114 | NTP address 2607:5300:201:3 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:201:3 | minor | 2019-09-10T08:08:12.089195 |
+--------------------------------------+----------+--------------------------------------------------...

Read more...

Revision history for this message
Daniel Badea (daniel.badea) wrote :

Load does not include the fix.
Proof:

$ grep -nHrIm1 CEPH_RESTARTING_FILE ALL_NODES_20190910.133325 2>&-
ALL_NODES_20190910.133325/controller-0_20190910.133325/etc/rc.d/init.d/ceph-init-wrapper:42:CEPH_RESTARTING_FILE="$VOLATILE_PATH/.ceph_restarting"
ALL_NODES_20190910.133325/controller-0_20190910.133325/etc/init.d/ceph-init-wrapper:42:CEPH_RESTARTING_FILE="$VOLATILE_PATH/.ceph_restarting"
ALL_NODES_20190910.133325/controller-1_20190910.133325/etc/rc.d/init.d/ceph-init-wrapper:42:CEPH_RESTARTING_FILE="$VOLATILE_PATH/.ceph_restarting"
ALL_NODES_20190910.133325/controller-1_20190910.133325/etc/init.d/ceph-init-wrapper:42:CEPH_RESTARTING_FILE="$VOLATILE_PATH/.ceph_restarting"

But CEPH_RESTARTING_FILE was removed by commit:
https://opendev.org/starlingx/integ/commit/9faad457030dc7ffd88f2e4106b01157c9b9a0ed?style=split

Revision history for this message
Peng Peng (ppeng) wrote :

Not seeing this issue recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers