System is unstable after swact due to ceph-mon critical process failures

Bug #2017133 reported by Pedro Vinícius Silva da Cruz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Pedro Vinícius Silva da Cruz

Bug Description

Brief Description
-----------------
After killing one of the critical process e.g fmManager, dnsmasq etc, swact occurs, but the system is not stable and multiple switchovers occur.

Severity
--------
Major

Steps to Reproduce
------------------
   1. On a AIO-DX or Standard lab, killed the critical process fmManager twice.
   2. This triggers a swact as expected.
   3. Try to log back in the system for normal usage.

Expected Behavior
------------------
The system should be stable after the swact and all applications/services should be running on the new Active controller.

Actual Behavior
----------------
The system is not stable and there are multiple switchovers due to controller-services failure and takes several minutes before the system gets to a stable state.

Reproducibility
---------------
100%

System Configuration
--------------------
AIO-DX

Last Pass
---------
2023-03-14_18-00-08

Test Activity
-------------
Automated Regression

Workaround
----------
Not known yet.

Changed in starlingx:
assignee: nobody → Pedro Vinícius Silva da Cruz (psilvada)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/880961

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/880961
Committed: https://opendev.org/starlingx/integ/commit/09e29800cb4b55ebd4370cf5f23a333c70259c4e
Submitter: "Zuul (22348)"
Branch: master

commit 09e29800cb4b55ebd4370cf5f23a333c70259c4e
Author: Pedro Vinícius Silva da Cruz <email address hidden>
Date: Thu Apr 20 08:36:22 2023 -0400

    Fix AIO-DX Uncontrolled Swact ceph-mon failure

    This change is the solution to resolve the scenario where after
    an uncontrolled swact due to killing one of the critical processes
    twice, the ceph-mon service doesn't start in the new active
    controller occasioning a new swact.

    It was created a flag to signalize a complete shutdown of ceph-mon.
    After an uncontrolled swact, the system verifies if the flag
    exists, and if so starts the ceph-mon service in the new active
    controller.

    Test Plan:
        PASS: System host-swact.
        PASS: Ceph recovery after rebooting the active controller.
        PASS: Ceph recovery after uncontrolled swact killing a critical
              process twice.
        PASS: Ceph recovery after mgmt network outage for a few minutes
              even when rebooting controllers.
        PASS: Ceph recovery after case of dead office recovery (DOR).
        PASS: Upgrade success from stx 7.0 to 8.0 in a duplex lab.

    Closes-bug: 2017133

    Signed-off-by: Pedro Vinícius Silva da Cruz <email address hidden>
    Change-Id: I6784ec76afa3e62ee14e8ca8f3d6c0212a9f6f3e

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.storage
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.