Locked controller comes up as the active controller after power cycling both controllers

Bug #2051578 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description:
------------------
Service Management (SM) sometimes selects and activates services on a locked controller following a dead office recovery. A typical Dead Office Recovery (DOR) is recovering from a power outage.

Severity:
---------
Major

Steps to Reproduce:
-------------------
Step 1: Lock a controller in an otherwise unlocked-enabled duplex system.
Step 2: Once locked, reboot or power cycle both controllers simultaneously.

Expected Behavior:
------------------
A locked controller should never be selected for service activation.

Actual Behavior:
----------------
Sometimes SM activates on a locked controller.

Reproducibility:
----------------
Intermittent

System Configuration:
---------------------
AIO Duplex

Work Around:
-----------
Retry ; reset or power cycle again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/907620

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/907623

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (4.4 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/907620
Committed: https://opendev.org/starlingx/metal/commit/d9982a3b7e32549d0a671cbd42e7c8fe18b783b4
Submitter: "Zuul (22348)"
Branch: master

commit d9982a3b7e32549d0a671cbd42e7c8fe18b783b4
Author: Eric MacDonald <email address hidden>
Date: Fri Feb 2 16:44:01 2024 +0000

    Mtce: Create non-volatile backup of node locked flag file

    The existing /var/run/.node_locked flag file is volatile.
    Meaning it is lost over a host reboot which has DOR implications.

    Service Management (SM) sometimes selects and activates services
    on a locked controller following a DOR (Dead Office Recovery).

    This update is part one of a two-part update that solves both
    of the above problems. Part two is a change to SM in the ha git.
    This update can be merged without part two.

    This update maintains the existing volatile node locked file because
    it is looked at by other system services. So to minimize the change
    and therefore patchback impact, a new non-volatile 'backup' of the
    existing node locked flag file is created.

    This update incorporates modifications to the mtcAgent and mtcClient,
    introducing a new backup file and ensuring their synchronized
    management to guarantee their simultaneous presence or absence.

    Note: A design choice was made to not use a symlink of one to the
          other rather than add support to manage symlinks in the code.
          This approach was chosen for its simplicity and reliability
          in directly managing both files. At some point in the future
          volatile file could be deprecated contingent upon identifying
          and updating all services that directly reference it.

    This update also removes some dead code that was adjacent to my update.

    Test Plan: This test plan covers the maintenance management of
               both files to ensure they always align and the expected
               behavior exists.

    PASS: Verify AIO DX Install.
    PASS: Verify Storage System Install.
    PASS: Verify Swact back and forth.
    PASS: Verify mtcClient and mtcAgent logging.
    PASS: Verify node lock/unlock soak.

    Non-volatile (Nv) node locked management test cases:

    PASS: Verify Nv node locked file is present when a node is locked.
          Confirmed on all node types.
    PASS: Verify any system node install comes up locked with both node
          locked flag files present.
    PASS: Verify mtcClient logs when a node is locked and unlocked.
    PASS: Verify Nv node locked file present/absent state mirrors the
          already existing /var/run/.node_locked flag file.
    PASS: Verify node locked file is present on controller-0 during
          ansible run following initial install and removed as part
          of the self-unlock.
    PASS: Verify the Nv node locked file is removed over the unlock
          along with the administrative state change prior to the
          unlock reboot.
    PASS: Verify both node locked files are always present or absent
          together.
    PASS: Verify node locked file management while the manag...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/907623
Committed: https://opendev.org/starlingx/ha/commit/23d0d8ab2f3225f10594547c5f8a67c409f815a0
Submitter: "Zuul (22348)"
Branch: master

commit 23d0d8ab2f3225f10594547c5f8a67c409f815a0
Author: Eric MacDonald <email address hidden>
Date: Fri Feb 2 16:52:50 2024 +0000

    Add node locked gate to SM enable

    Service Management (SM) sometimes selects and activates services on a
    locked controller following a dead office recovery.

    This update adds a node locked check to SM's enable handler to
    block enable if present much like the existing goenabled check
    blocks enable if not present in the same function.

    The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

    Maintenance manages the presence or absence of this file based on
    the node's administrative state.

    This update also cleans up some extra whitespace in the changed file.

    Test Plan:

    PASS: Verify system build.
    PASS: Verify AIO DX install.
    PASS: Verify Standard DX system install with worker and storage.

    For Both 'AIO DX' and 'Standard DX with worker and storage':

    PASS: Verify SM does not activate on a locked controller.
    PASS: ... DOR case
    PASS: ... Uncontrolled Swact case
    PASS: Verify Standard DX behavior over DOR with one locked controller
          while the only unlocked controller does not recover.
    PASS: Verify behavior after above test case once the only unlocked
          controller does recover.
    PASS: Verify lock of the standby controller and its sm logs
    PASS: Verify manually creating the new Nv locked file on the active
          controller will cause SM to go disabled and shut down all
          services on that controller.
          ... If there is another unlocked controller then verify it
              takes over as an uncontrolled swact.
          ... If there is no unlocked standby controller then verify SM
              remains shutdown until the manually created Nv node locked
              file is removed. At which point SM proceeds to activate
              services on that controller again.

    Regression:

    PASS: Verify controlled swact with unlocked enabled standby.
    PASS: Verify uncontrolled swact with unlocked enabled standby.
    PASS: Verify standby controller lock/unlock soak loop (10).
    PASS: Verify swact loop soak (10).
    PASS: Verify no crash or core dumps.

    Closes-Bug: 2051578
    Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.9.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
importance: Undecided → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

The above change was reverted as it caused a sanity issue related to AIO-SX lock/unlock.
Revert: https://review.opendev.org/c/starlingx/ha/+/909836

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ha/+/910227

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/c/starlingx/ha/+/910227
Committed: https://opendev.org/starlingx/ha/commit/91fa44188cd6fa24645e958550cf3c9c0ca3e654
Submitter: "Zuul (22348)"
Branch: master

commit 91fa44188cd6fa24645e958550cf3c9c0ca3e654
Author: Eric MacDonald <email address hidden>
Date: Fri Feb 2 16:52:50 2024 +0000

    Add node locked gate to SM enable for DX systems

    Service Management (SM) sometimes selects and activates services on a
    locked controller following a dead office recovery.

    This update adds a node locked check to SM's enable handler to
    block enable if present much like the existing goenabled check
    blocks enable if not present in the same function.

    The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

    Maintenance manages the presence or absence of this file based on
    the node's administrative state.

    This update also cleans up some extra whitespace in the changed file.

    Test Plan:

    PASS: Verify system build.
    PASS: Verify AIO SX install.
    PASS: Verify AIO DX install.
    PASS: Verify Standard DX system install with worker and storage.

    For Both 'AIO DX' and 'Standard DX with worker and storage':

    PASS: Verify SM does not activate on a locked DX controller.
    PASS: ... DOR case
    PASS: ... Uncontrolled Swact case
    PASS: Verify Standard DX behavior over DOR with one locked controller
          while the only unlocked controller does not recover.
    PASS: Verify behavior after above test case once the only unlocked
          controller does recover.
    PASS: Verify lock of the standby controller and its sm logs
    PASS: Verify manually creating the new Nv locked file on the active
          controller will cause SM to go disabled and shut down all
          services on that controller.
          ... If there is another unlocked controller then verify it
              takes over as an uncontrolled swact.
          ... If there is no unlocked standby controller then verify SM
              remains shutdown until the manually created Nv node locked
              file is removed. At which point SM proceeds to activate
              services on that controller again.

    PASS: Verify SM ignores the node locked flag file for AIO SX systems.
    PASS: Verify lock/unlock of AIO SX controller.
    PASS: Verify original reported issue is resolved for AIO DX systems.

    Regression:

    PASS: Verify controlled swact with unlocked enabled standby.
    PASS: Verify uncontrolled swact with unlocked enabled standby.
    PASS: Verify standby controller lock/unlock soak loop (10).
    PASS: Verify swact loop soak (10).
    PASS: Verify no crash or core dumps.
    PASS: Verify SM logging

    Closes-Bug: 2051578
    Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
    (cherry picked from commit 23d0d8ab2f3225f10594547c5f8a67c409f815a0)

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/912278

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/912278
Committed: https://opendev.org/starlingx/metal/commit/3c94b0e552ce7aeed29603a3476b7a7d07847f84
Submitter: "Zuul (22348)"
Branch: master

commit 3c94b0e552ce7aeed29603a3476b7a7d07847f84
Author: Eric MacDonald <email address hidden>
Date: Fri Mar 8 15:55:33 2024 +0000

    Avoid creating non-volatile node locked file while in simplex mode

    It is possible to lock controller-0 on a DX system before controller-1
    has been configured/enabled. Due to the following recent updates this
    can lead to SM disabling all controller services on that now locked
    controller-0 thereby preventing any subsequent controller-0 unlock
    attempts.

    https://review.opendev.org/c/starlingx/metal/+/907620
    https://review.opendev.org/c/starlingx/ha/+/910227

    This update modifies the mtce node locked flag file management so that
    the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only
    created on a locked host after controller-1 is installed, provisioned
    and configured.

    This prevents SM from shutting down if the administrator locks
    controller-0 before controller-1 is configured.

    Test Plan:

    PASS: Verify AIO DX Install.
    PASS: Verify Standard System Install.
    PASS: Verify Swact back and forth.
    PASS: Verify lock/unlock of controller-0 prior to controller-1 config
    PASS: Verify the non-volatile node locked flag file is not created
          while the /etc/platform/simplex file exists on the active
          controller.
    PASS: Verify lock and delete of controller-1 puts the system back
          into simplex mode where the non-volatile node locked flag file
          is once again not created if controller-0 is then unlocked.
    PASS: Verify an existing non-volatile node locked flag file is removed
          if present on a node that is locked without new persist option.
    PASS: Verify original reported issue is resolved for DX systems.

    Closes-Bug: 2051578
    Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be
    Signed-off-by: Eric MacDonald <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.