StarlingX

After host-swact completed, unexpected swact occurred due to ceph error

Bug #1840176 reported by Peng Peng on 2019-08-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Daniel Badea

Bug Description

Brief Description
-----------------
5 mins after host-swact completed, an unexpected swact occurred and alarm-list showed 270.001 "Host controller-1 compute services failure"

Severity
--------
Major

Steps to Reproduce
------------------
host-swact
run some reading cmd

TC-name: mtc/test_swact.py::test_swact_controller_platform

Expected Behavior
------------------
no unexpected swact happened

Actual Behavior
----------------
unexpected swact happened

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-13_20-59-00

Last Pass
---------
Lab: WCP_63_66
Load: 2019-08-13_20-59-00

Timestamp/Logs
--------------
[2019-08-14 08:52:10,857] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[sysadmin@controller-0 ~(keystone_admin)]$
[2019-08-14 08:52:35,441] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 08:52:38,545] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.151.216 disconnected
[2019-08-14 08:52:38,545] 1564 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.151.216 OAM floating IP disconnected, indicating swact initiated.
[2019-08-14 08:53:08,575] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.151.216
[2019-08-14 08:53:09,956] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 08:53:10,059] 423 DEBUG MainThread ssh.expect :: Output:
controller-1:~$

controller-1:~$
[2019-08-14 08:56:09,694] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-14 08:56:09,695] 301 DEBUG MainThread ssh.send :: Send 'stat /etc/platform/.task_affining_incomplete'
[2019-08-14 08:57:09,812] 394 WARNING MainThread ssh.expect :: No match found for ['.*controller\\-[01][:| ].*\\$ '].
expect timeout.

[2019-08-14 09:01:50,420] 793 DEBUG MainThread ssh.close :: connection closed. host: 128.224.151.216, user: sysadmin. Object ID: 140532888100312
[2019-08-14 09:01:50,420] 248 DEBUG MainThread ssh.connect :: Retry in 10 seconds
[2019-08-14 09:02:01,700] 301 DEBUG MainThread ssh.send :: Send ''
[2019-08-14 09:02:01,803] 423 DEBUG MainThread ssh.expect :: Output:
controller-0:~$

[2019-08-14 09:02:16,375] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-14 09:02:17,752] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| f900135a-72ca-42b1-a8dc-00a2c76e52e7 | 270.001 | Host controller-1 compute services failure, | host=controller-1.services=compute | critical | 2019-08-14T09:02:12.648454 |
| 752dc3ff-4e14-4acc-8482-b4de9c90cd03 | 200.004 | controller-1 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. | host=controller-1 | critical | 2019-08-14T09:02:08.649239 |
| a8b4dc4e-429c-4650-86ca-bd542b56aed6 | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-08-14T09:02:01.418293 |
| a6c0928c-b82f-4145-bab6-329dcd84a2a3 | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=directory-services | major | 2019-08-14T09:02:01.337270 |
| c06d1156-8220-4b84-82a5-5e23798d3154 | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=web-services | major | 2019-08-14T09:02:01.256242 |
| 6edce745-1622-4afc-b64a-26ecbbf08faf | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=storage-services | major | 2019-08-14T09:02:01.175300 |
| d88e327a-861a-425a-81cc-b78cc2a2b994 | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-08-14T09:02:00.501254 |
| 700abbc8-c44f-495a-b08a-ad655d23982a | 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=patching-services | major | 2019-08-14T09:02:00.420246 |
| fe6b6b82-eb2d-4919-9673-86bcd39e91cc | 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=storage-monitoring-services | major | 2019-08-14T09:02:00.096293 |
| 0bac73c7-0207-4140-a382-6011c8c68dc7 | 400.005 | Communication failure detected with peer over port enp11s0f1 on host controller-0 | host=controller-0.network=mgmt | major | 2019-08-14T09:02:00.015249 |
| 2b2b2bf2-47ae-4b75-8bbc-81815347613e | 400.005 | Communication failure detected with peer over port enp11s0f0 on host controller-0 | host=controller-0.network=oam | major | 2019-08-14T09:01:59.934269 |
| c570309c-2e7e-48ef-aac6-69ea24d7980f | 400.005 | Communication failure detected with peer over port ens3f0 on host controller-0 | host=controller-0.network=cluster-host | major | 2019-08-14T09:01:59.853257 |
| 001ca8f1-fee3-4f3d-ba78-7b935518a8ea | 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-08-14T09:01:59.611279 |
| 74ec6293-a70a-4c24-97e5-1fb5bde6549a | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-08-14T09:01:59.449250 |
| 113a5ed5-3971-43c1-b5c4-7f084e17c50d | 100.114 | NTP address 2607:5300:60:97 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:60:97 | minor | 2019-08-14T08:39:57.707422 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
controller-0:~$

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-14:

ALL_NODES_20190814.171927.tar Edit (59.5 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-14:

Assigning to Bin to triage before deciding on priority / release gate.

tags:	added: stx.ha stx.sanity
Changed in starlingx:
assignee:	nobody → Bin Qian (bqian20)

Numan Waheed (nwaheed) on 2019-08-14

tags:

added: stx.retestneeded

Revision history for this message

Bin Qian (bqian20) wrote on 2019-08-14:

Download full text (5.1 KiB)

Logs show that the controller-1 was shutdown, due to ceph-mon fatal failure (exceed max transition failures). Then controller-0 went active as expected.

2019-08-14T09:01:13.000 controller-1 sm: debug time[1513.951] log<1946> INFO: sm[97172]: sm_service_audit.c(364): Service (dbmon) already running audit-disabled action.
2019-08-14T09:01:13.000 controller-1 sm: debug time[1514.051] log<1947> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:14.000 controller-1 sm: debug time[1514.555] log<1948> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:14.000 controller-1 sm: debug time[1515.056] log<1949> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.558] log<1950> INFO: sm[97172]: sm_service_engine.c(157): Service (ceph-mon) has had a fatal failure and is unrecoverable.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.917] log<1951> INFO: sm[97172]: sm_service_group_api.c(212): Service group (controller-services) recovery from fatal condition escalated to a reboot.
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.917] log<1952> INFO: sm[97172]: sm_node_api.cpp(964): Reboot of controller-1 requested, reason=service group (controller-services) recovery from fatal condition escalated to a reboot..
2019-08-14T09:01:15.000 controller-1 sm: debug time[1515.918] log<1953> INFO: sm[97172]: sm_troubleshoot.c(150): Troubleshoot process (401470) created.

Logs show that the controller-1 was shutdown, due to ceph-mon fatal failure (exceed max transition failures). Then controller-0 went active as expected.

ceph-mon errors:
2019-08-14T09:00:01.762 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:01.768 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:01.771 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon
2019-08-14T09:00:02.777 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:02.783 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:02.786 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon
2019-08-14T09:00:03.792 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:03.798 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:03.800 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon
2019-08-14T09:00:04.806 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:04.811 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:04.814 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon
2019-08-14T09:00:05.819 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:05.824 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:05.827 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon
2019-08-14T09:00:06.833 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon still Primary, demoting.
2019-08-14T09:00:06.839 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: drbd-cephmon is still mounted via /dev/drbd9
2019-08-14T09:00:06.842 controller-1 OCF_drbd(drbd-cephmon:warning 1)[380462]: WARNING: Waiting to drop drbd-cephmon

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-15: Fix merged to nfv (master)

Reviewed: https://review.opendev.org/676689
Committed: https://git.openstack.org/cgit/starlingx/nfv/commit/?id=f98b388a745f95737bd4c05c24d59aed8b5e3699
Submitter: Zuul
Branch: master

commit f98b388a745f95737bd4c05c24d59aed8b5e3699
Author: Bart Wensley <email address hidden>
Date: Thu Aug 15 07:41:01 2019 -0500

Correct VIM host audit criteria for failing instances

    The VIM's host audit will fail instances on any host that is
    "not enabled". That includes hosts where the operational state
    is unknown.

    Updating the check to ensure the host is "disabled" not that it
    is "not enabled" to avoid failing instances on a host where we
    don't know the operational state.

    Change-Id: I68d3e9f63695de721c10fb1dd2b7ac5917cb50fa
    Closes-Bug: 1840176
    Signed-off-by: Bart Wensley <email address hidden>

Changed in starlingx:
status:	New → Fix Released

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-08-15: Re: After host-swact completed, unexpected swact occurred

Woops - the above commit referenced the wrong bug. Please ignore.

Dariush Eslimi (deslimi) on 2019-08-15

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-16:

Marking as stx.3.0 / medium priority -- appears to be a single occurrence due to a ceph failure

summary:	- After host-swact completed, unexpected swact occurred + After host-swact completed, unexpected swact occurred due to ceph error
tags:	added: stx.3.0 stx.storage removed: stx.ha
Changed in starlingx:
importance:	Undecided → Medium
assignee:	Bin Qian (bqian20) → Daniel Badea (daniel.badea)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-26: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678474

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-09: Fix merged to integ (master)

Reviewed: https://review.opendev.org/678474
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9faad457030dc7ffd88f2e4106b01157c9b9a0ed
Submitter: Zuul
Branch: master

commit 9faad457030dc7ffd88f2e4106b01157c9b9a0ed
Author: Daniel Badea <email address hidden>
Date: Mon Aug 26 06:59:28 2019 +0000

ceph-init-wrapper use flock instead of flag files

    When swact occurs and ceph-init-wrapper is slow to respond
    to a status request it gets killed by SM. This means the
    corresponding flag file that marks status in progress is left
    behind.

    When controller swacts back ceph-init-wrapper sees status
    in progress and waits for it to finish (with a timeout).
    Because it does not respond fast enough SM tries to start
    again ceph-init-wrapper to get ceph-mon service up and running.

This happens a couple of times until the service is declared
failed and controller swacts back.

    To fix this we need to use flock instead of flag files as the
    locks will be automatically released by the OS when process
    is killed.

    Change-Id: If1912e8575258a4f79321d8435c8ae1b96b78b98
    Closes-bug: 1840176
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-09-10:

ALL_NODES_20190910.133325.tar Edit (37.9 MiB, application/x-tar)

Download full text (3.8 KiB)

Issue was reproduced on
Lab: WCP_76_77
Load: 2019-09-09_00-10-00
Job: Titanium_R6_build

[2019-09-10 08:22:07,526] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-1'

[sysadmin@controller-1 ~(keystone_admin)]$
[2019-09-10 08:22:31,455] 301 DEBUG MainThread ssh.send :: Send ''
[2019-09-10 08:22:34,559] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.150.5 disconnected
[2019-09-10 08:22:34,560] 1572 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.150.5 OAM floating IP disconnected, indicating swact initiated.
[2019-09-10 08:23:04,590] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.150.5
[2019-09-10 08:23:06,099] 301 DEBUG MainThread ssh.send :: Send ''
[2019-09-10 08:23:06,204] 423 DEBUG MainThread ssh.expect :: Output:
controller-0:~$

[2019-09-10 08:25:51,734] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-09-10 08:25:53,012] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+---------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+----------+----------------------------+
| 32d95c7d-f2d7-4c51-917c-9a122b6a6923 | 400.001 | Service group controller-services failure; ceph-mon(disabling, failed) | service_domain=controller.service_group=controller-services.host=controller-0 | critical | 2019-09-10T08:25:51.606031 |
| 2177bd57-2a1c-43fd-b54f-41931c6221bc | 400.002 | Service group controller-services has no active members available; expected 1 active member | service_domain=controller.service_group=controller-services | critical | 2019-09-10T08:22:30.015971 |
| 4c643552-1006-48a2-8234-d6d6d3f3fbf7 | 100.114 | NTP address 2607:5300:201:3 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607:5300:201:3 | minor | 2019-09-10T08:08:12.089195 |
+--------------------------------------+----------+--------------------------------------------------...

Duplicates of this bug

Bug #1842966

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.