AIO-DX after host-swact, active controller not change due to ceph-mon error

Bug #1836075 reported by Peng Peng
30
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
After host-swact, ssh connection disconnected, but when ssh re-connected. it is still connected to same host.

Severity
--------
Major

Steps to Reproduce
------------------
host-swact

TC-name: test_swact.py::test_swact_controllers

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Seen once

System Configuration
--------------------
Two node system

Lab-name: IP_5-6

Branch/Pull Time/Commit
-----------------------
stx master as of 20190708T233000Z

Last Pass
---------
20190707T013000Z

Timestamp/Logs
--------------
[2019-07-10 11:24:11,047] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-swact controller-0'

[2019-07-10 11:24:37,886] 275 INFO MainThread ssh.wait_for_disconnect:: ssh session to 128.224.151.216 disconnected
[2019-07-10 11:24:37,887] 1564 INFO MainThread host_helper.wait_for_swact_complete:: ssh to 128.224.151.216 OAM floating IP disconnected, indicating swact initiated.
[2019-07-10 11:25:07,896] 301 DEBUG MainThread ssh.send :: Send ''
[2019-07-10 11:25:11,000] 151 INFO MainThread ssh.connect :: Attempt to connect to host - 128.224.151.216

controller-0:~$
[2019-07-10 11:26:14,483] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-07-10 11:26:14,484] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-show controller-0'
[2019-07-10 11:26:16,146] 423 DEBUG MainThread ssh.expect :: Output:
+---------------------+----------------------------------------------------------------------+
| Property | Value |
+---------------------+----------------------------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | degraded |

[2019-07-10 11:26:20,418] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne servicegroup-list'
[2019-07-10 11:26:22,763] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+-----------------------------+--------------+------------------+
| uuid | service_group_name | hostname | state |
+--------------------------------------+-----------------------------+--------------+------------------+
| 2ebeab1b-3a27-4052-8180-1507cb141f24 | cloud-services | controller-0 | active |
| 2970594f-c951-4975-bd54-79d04e091211 | cloud-services | controller-1 | standby-warn |

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Bin to triage before deciding on whether this is a gating issue or not

tags: added: stx.ha
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
Revision history for this message
Bin Qian (bqian20) wrote :

| 2019-07-10T11:24:29.997 | 312 | node-scn | controller-0 | | swact | issued against host controller-0

| 2019-07-10T11:25:35.026 | 451 | service-group-scn | controller-services | disabling | disabling-failed | ceph-mon(disabling, failed)
| 2019-07-10T11:25:38.863 | 455 | service-group-scn | vim-services | disabled | go-active |
| 2019-07-10T11:25:38.864 | 456 | service-group-scn | cloud-services | disabled | go-active |

ceph-mon failed going disabled. Which failed the swact. No logs were found to tell what went wrong with ceph-mon.

A separated issue, dbmon was running fine after stx-openstack was applied around 08:10, then it started failing after reapplying stx-openstack, the reapply apparently was not successful and sysinv reported that the stx-openstack as "not applied".

This result does not match what was expected as a failed reapply would not change the fact that the application is applied and running with latest successful apply. The dbmon continue running but it lose access to the mariadb pod and other resources, so it report failure since the reapply.

2019-07-10 08:10:05.824 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply completed.
...
2019-07-10 11:01:21.802 110608 INFO sysinv.conductor.kube_app [-] Application stx-openstack (1.0-17-centos-stable-versioned) apply started.
2019-07-10 11:08:37.690 110608 ERROR sysinv.conductor.kube_app [-] Failed to apply application manifest /manifests/stx-openstack/1.0-17-centos-stable-versioned/stx-openstack-stx-openstack.yaml. See /var/log/armada stx-openstack-apply.log for details.
2019-07-10 11:08:37.696 110608 INFO sysinv.conductor.kube_app [-] Exiting progress monitoring thread for app stx-openstack
2019-07-10 12:17:52.178 228896 INFO sysinv.api.controllers.v1.host [-] stx-openstack system app is present but not applied, skipping re-apply

| 2019-07-10T08:11:06.772 | 290 | service-scn | dbmon | unknown | enabled-active | audit success
| 2019-07-10T11:05:46.432 | 295 | service-scn | dbmon | enabled-active | disabling | audit failed
...

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Bin, can you open a separate launchpad for the dbmon issue? I assume it needs to be investigated/resolved.

Revision history for this message
Bin Qian (bqian20) wrote :

Launchpad https://bugs.launchpad.net/starlingx/+bug/1836634 is created to track the unexpected stx-openstack application behavior after reapply failed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 - ceph-mon error is preventing swacts from occurring. Frequency is not fully understood.

summary: - AIO-DX after host-swact, active controller not change
+ AIO-DX after host-swact, active controller not change due to ceph-mon
+ error
Changed in starlingx:
assignee: Bin Qian (bqian20) → Stefan Dinescu (stefandinescu)
tags: added: stx.2.0 stx.storage
removed: stx.ha
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Bart Wensley (bartwensley) wrote :

I saw this same issue in an AIO-DX lab (wolfpass_1-2) in a designer load built on July 15th. In my case swacting from controller-1 to controller-0 failed because the ceph-mon service did not come up. Even worse, the automatic swact back to controller-1 also failed for the same reason. Finally, the third swact was OK.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672708

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672709

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/672708
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=95367fd675529363280c09120bb3c73c09f6dbc3
Submitter: Zuul
Branch: master

commit 95367fd675529363280c09120bb3c73c09f6dbc3
Author: Stefan Dinescu <email address hidden>
Date: Thu Jul 25 14:54:43 2019 +0300

    Increase SM timeout for ceph-mon

    Note: this only affects AIO-DX setups as that is the only kind
          of setup where ceph-mon is managed by SM

    In some edge-cases, during a swact, ceph-mon may take too long
    to be stopped on the active controller resulting in a failed
    swact.

    This change increases the timeout to account for those
    edge cases.

    Change-Id: I3ace73650e4fe9aafc84c82e2ffe048f2039305e
    Partial-bug: 1836075
    Signed-off-by: Stefan Dinescu <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/672709
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=12f604b4dd86663910fa152ea50265a2ae56d932
Submitter: Zuul
Branch: master

commit 12f604b4dd86663910fa152ea50265a2ae56d932
Author: Stefan Dinescu <email address hidden>
Date: Thu Jul 25 15:00:21 2019 +0300

    Change ceph-init-wrapper wait logic

    The stop, start and restart commands are waiting for any status
    commands to finish before attempting the actual command

    This would cause issues as some commands that are related to OSDs
    only would wait for monitor status and vice-versa.

    Depending on the number of OSD, the osd status command would take
    too much time to finish, resulting on a "stop mon" command to
    wait just as much, even though it didn't need to

    Changes in this commit:
    - commands related to OSD and monitors have their own wait times
      and separate flag files
    - add improved logging to better see if the script is waiting
      for a certain function too finish

    Change-Id: Ia03981b2b49f999e8a96aa12361209a418da4c50
    Closes-bug: 1836075
    Depends-On: I3ace73650e4fe9aafc84c82e2ffe048f2039305e
    Signed-off-by: Stefan Dinescu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

The issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.