2 computes in degraded state after DOR

Bug #1829289 reported by Nimalini Rasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bin Qian

Bug Description

Brief Description
-----------------
After DOR (Dead Office Recovery), 2 out of 10 computes stuck in degraded state. The following alarms were found:
200.004 compute-3 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-3 critical 2019-05-15T18:48:03.140007
 200.004 compute-7 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-7 critical 2019-05-15T18:48:02.170012

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------
2+10 system with 500 pods, power down all nodes and power up all nodes.

Expected Behavior
------------------
Expected all nodes to recover after DOR.

Actual Behavior
----------------
2 compute nodes stuck in degraded.

Reproducibility
---------------
seen once.

System Configuration
--------------------
2+10 system with 500 pods.

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-05-09"

Last Pass
---------
First try.

Timestamp/Logs
--------------
Wed May 15 14:46:25 EDT 2019 login prompt of controllers after power on

Test Activity
-------------
platform testing

Revision history for this message
Nimalini Rasa (nrasa) wrote :
Revision history for this message
Nimalini Rasa (nrasa) wrote :
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.

Here are the logs for compute-7 ; same signature for compute-3

2019-05-15T18:54:36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp (1712) alarm_enabled_clear : Info : compute-7 enable alarm clear
2019-05-15T18:54:36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'In-Service' alarm (200.004)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'Configuration' alarm (200.011)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-7 sending 'host' state change to vim (enabled)
2019-05-15T18:54:36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1567) enable_handler : Info : compute-7 is ENABLED

This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.

2019-05-15T18:54:37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp ( 540) workQueue_process :Error : compute-3 mtcVimApi_state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1)
2019-05-15T18:54:37.534 fmAPI.cpp(131): Failed to connect to FM Manager.

The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.

2019-05-15T18:54:34.881 fmMsgServer.cpp(414): Send response for create log, uuid:(6b224ca2-2d80-4481-a865-d12faaf23081) (0)

When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.

Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.

The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

To reproduce this issue there needs to be hosts enabling while alarm 200.004 is asserted and before the enable is complete there is a swact. Mtce on the new side reads the alarm, makes the host go degraded and never takes action to deal with it.

Recovery is by lock/unlock of affected host(s)

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; system doesn't fully recover after a dead office recovery test

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2.0 stx.metal
Bin Qian (bqian20)
Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670603

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ha (master)

Reviewed: https://review.opendev.org/670603
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=a729bbabc64596f9bfafd78d601d9129c7d3c614
Submitter: Zuul
Branch: master

commit a729bbabc64596f9bfafd78d601d9129c7d3c614
Author: Bin Qian <email address hidden>
Date: Fri Jul 12 13:00:09 2019 -0400

    Add mtc-agent service dependency to fm-mgr

    Add mtc-agent service dependency to fm-mgr to ensure mtc-agent shuts
    down before fm-mgr does.

    An issue was found that in rare cases a swact occurs when mtc-agent
    try to clear an alarm, while fm-mgr has been disabled, clear alarm
    message went lost. The alarm therefor remained not being able to
    clear.

    Closes-bug 1829289

    Change-Id: I39196d5f3ce764a14b4d1e0fb1a4f3344ddd6a1a
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This was verified in 2+2+4 system load "20190814T053000Z".

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Verified in system with 2+2+8 load 2019-08-12_20-59-00 with 500 pods.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.