Able to swact to a controller that is in the Locking state

Bug #2064347 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
It takes a finite period of time to lock a controller.
So there is a small window of time between when a lock command towards the inactive controller is issued and that controller actually enters the locked state. This window of time is typically a few seconds. However, that time can be upwards of a minute or more if the system is busy and/or there are VMs or other migrations that have to happen before the node is permitted to enter the locked state.

Cases have been seen where issuing a 'system host-swact' command while the inactive controller is in this 'Locking but not yet Locked' state leads the switch of activity to a locked controller.

The existing pre swact semantic check is not sufficient to prevent this race condition that can lead to a locked active controller.

Severity
--------
Minor : Based on probability which is very low.
          There is no reason to swact to a controller you are trying to lock. Just don't do it.
Critical: Based on system impact.
          See work around.

Steps to Reproduce
------------------
system host-lock controller-1 ; sleep 1 ; system host-swact controller-0

Expected Behavior
------------------
Swact does not occur. Locked controller does not activate.

Actual Behavior
----------------
Swact does occur. Locked controller does activate.

Reproducibility
---------------
100%

System Configuration
--------------------
Any DX system

Branch/Pull Time/Commit
-----------------------
Any context prior to the date of this bug report (April 30, 2024)

Last Pass
---------
Never tested

Timestamp/Logs
--------------

  sysadmin@controller-1:~$ source /etc/platform/openrc ; system host-list
  +----+--------------+-------------+----------------+-------------+--------------+
  | id | hostname | personality | administrative | operational | availability |
  +----+--------------+-------------+----------------+-------------+--------------+
  | 1 | controller-0 | controller | unlocked | enabled | available |
  | 2 | controller-1 | controller | locked | disabled | online |
  +----+--------------+-------------+----------------+-------------+--------------+

Test Activity
-------------
Developer Testing

Workaround
----------
Unlock active controller. System will swact to unlocked controller and will reboot and unlock the previously locked/active controller.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/917791

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/917844

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by "Eric MacDonald <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/config/+/917844
Reason: Accidentally posted review with new change id

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/917791
Committed: https://opendev.org/starlingx/config/commit/f29cc84ba3dcfd634e338660d0657d3ec557287e
Submitter: "Zuul (22348)"
Branch: master

commit f29cc84ba3dcfd634e338660d0657d3ec557287e
Author: Eric MacDonald <email address hidden>
Date: Tue Apr 30 21:16:01 2024 +0000

    Prevent swacting to a 'Locking' controller

    Locking a controller takes a finite amount of time, resulting in a
    brief window between issuing a lock command toward the inactive
    controller and the controller actually entering the locked state.

    Typically, this window lasts only a few seconds. However, during
    periods of high system activity or when VMs or other migrations are
    occurring, it can extend to a minute or longer before the controller
    enters the locked state.

    In some cases, initiating a 'system host-swact' command while the
    inactive controller is in this 'Locking but not yet Locked' state has
    led to a switch of activity to a locked controller.

    The current pre-swact semantic check is inadequate in preventing
    this race condition, which could result in a locked active controller.

    This update adds a precheck of a list of in-progress actions, any of
    which will now reject a swact request.

    Test Plan:

    PASS: Verify sysinv package build.
    PASS: Verify swact is rejected for any of the in-progress actions
          listed in the precheck.
    PASS: Verify swact reject handling and output text.
    PASS: Verify pep8 of changed lines.

    Regression:

    PASS: Verify swact handling when task is empty
    PASS: Verify swact handling when task is not empty and not Locking
    PASS: Verify Swact soak (10x)

    Closes-Bug: 2064347
    Change-Id: I78238fa649c330d7b908dbcf50f654c004205ee6
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.10.0 stx.config
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.