StarlingX

AIO-DX after host-swact, active controller not change due to ceph-mon error

Bug #1836075 reported by Peng Peng on 2019-07-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Stefan Dinescu

Bug Description

Brief Description
-----------------
After host-swact, ssh connection disconnected, but when ssh re-connected. it is still connected to same host.

Severity
--------
Major

Steps to Reproduce
------------------
host-swact

TC-name: test_swact.py::test_swact_controllers

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 20190708T233000Z

Last Pass
---------
20190707T013000Z

Timestamp/Logs
--------------
[2019-07-10 11:24:11,047] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-07-10 11:24:37,886] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.151.216 disconnected
[2019-07-10 11:24:37,887] 1564 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.151.216 OAM floating IP disconnected, indicating swact initiated.
[2019-07-10 11:25:07,896] 301 DEBUG MainThread ssh.send :: Send ''
[2019-07-10 11:25:11,000] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.151.216

controller-0:~$
[2019-07-10 11:26:14,483] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-10 11:26:14,484] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-07-10 11:26:16,146] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+----------------------------------------------------------------------+
| Property | Value |
+---------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | degraded |

[2019-07-10 11:26:20,418] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne servicegroup-list'
[2019-07-10 11:26:22,763] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-----------------------------+--------------+------------------+
| uuid | service_group_name | hostname | state |
+--------------------------------------+-----------------------------+--------------+------------------+
| 2ebeab1b-3a27-4052-8180-1507cb141f24 | cloud-services | controller-0 | active |
| 2970594f-c951-4975-bd54-79d04e091211 | cloud-services | controller-1 | standby-warn |

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-10:

ALL_NODES_20190710.141921.tar Edit (46.0 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-10:

Assigning to Bin to triage before deciding on whether this is a gating issue or not

tags:	added: stx.ha
Changed in starlingx:
assignee:	nobody → Bin Qian (bqian20)

Revision history for this message

Bin Qian (bqian20) wrote on 2019-07-10:

ceph-mon failed going disabled. Which failed the swact. No logs were found to tell what went wrong with ceph-mon.

A separated issue, dbmon was running fine after stx-openstack was applied around 08:10, then it started failing after reapplying stx-openstack, the reapply apparently was not successful and sysinv reported that the stx-openstack as "not applied".

This result does not match what was expected as a failed reapply would not change the fact that the application is applied and running with latest successful apply. The dbmon continue running but it lose access to the mariadb pod and other resources, so it report failure since the reapply.

2019-07-10 08:10:05.824 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply completed.
...
2019-07-10 11:01:21.802 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply started.
2019-07-10 11:08:37.690 110608 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada stx-openstack-apply.log for details.
2019-07-10 11:08:37.696 110608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-10 12:17:52.178 228896 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply

ceph-mon failed going disabled. Which failed the swact. No logs were found to tell what went wrong with ceph-mon.

Numan Waheed (nwaheed) on 2019-07-15

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-15:

Bin, can you open a separate launchpad for the dbmon issue? I assume it needs to be investigated/resolved.

Revision history for this message

Bin Qian (bqian20) wrote on 2019-07-16:

Launchpad https://bugs.launchpad.net/starlingx/+bug/1836634 is created to track the unexpected stx-openstack application behavior after reapply failed.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-17:

Marking as stx.2.0 - ceph-mon error is preventing swacts from occurring. Frequency is not fully understood.

summary:	- AIO-DX after host-swact, active controller not change + AIO-DX after host-swact, active controller not change due to ceph-mon + error
Changed in starlingx:
assignee:	Bin Qian (bqian20) → Stefan Dinescu (stefandinescu)
tags:	added: stx.2.0 stx.storage removed: stx.ha
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-19:

I saw this same issue in an AIO-DX lab (wolfpass_1-2) in a designer load built on July 15th. In my case swacting from controller-1 to controller-0 failed because the ceph-mon service did not come up. Even worse, the automatic swact back to controller-1 also failed for the same reason. Finally, the third swact was OK.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672708

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672709

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-30: Fix merged to ha (master)

#10

Reviewed: https://review.opendev.org/672708
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=95367fd675529363280c09120bb3c73c09f6dbc3
Submitter: Zuul
Branch: master

commit 95367fd675529363280c09120bb3c73c09f6dbc3
Author: Stefan Dinescu <email address hidden>
Date: Thu Jul 25 14:54:43 2019 +0300

Increase SM timeout for ceph-mon

Note: this only affects AIO-DX setups as that is the only kind
of setup where ceph-mon is managed by SM

    In some edge-cases, during a swact, ceph-mon may take too long
    to be stopped on the active controller resulting in a failed
    swact.

This change increases the timeout to account for those
edge cases.

    Change-Id: I3ace73650e4fe9aafc84c82e2ffe048f2039305e
    Partial-bug: 1836075
    Signed-off-by: Stefan Dinescu <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31: Fix merged to integ (master)

#11

Reviewed: https://review.opendev.org/672709
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=12f604b4dd86663910fa152ea50265a2ae56d932
Submitter: Zuul
Branch: master

commit 12f604b4dd86663910fa152ea50265a2ae56d932
Author: Stefan Dinescu <email address hidden>
Date: Thu Jul 25 15:00:21 2019 +0300

Change ceph-init-wrapper wait logic

The stop, start and restart commands are waiting for any status
commands to finish before attempting the actual command

This would cause issues as some commands that are related to OSDs
only would wait for monitor status and vice-versa.

    Depending on the number of OSD, the osd status command would take
    too much time to finish, resulting on a "stop mon" command to
    wait just as much, even though it didn't need to

    Changes in this commit:
    - commands related to OSD and monitors have their own wait times
      and separate flag files
    - add improved logging to better see if the script is waiting
      for a certain function too finish

    Change-Id: Ia03981b2b49f999e8a96aa12361209a418da4c50
    Closes-bug: 1836075
    Depends-On: I3ace73650e4fe9aafc84c82e2ffe048f2039305e
    Signed-off-by: Stefan Dinescu <email address hidden>