2 computes in degraded state after DOR

Bug #1829289 reported by Nimalini Rasa on 2019-05-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Undecided
Eric MacDonald

Bug Description

Brief Description
-----------------
After DOR (Dead Office Recovery), 2 out of 10 computes stuck in degraded state. The following alarms were found:
200.004 compute-3 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-3 critical 2019-05-15T18:48:03.140007
 200.004 compute-7 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-7 critical 2019-05-15T18:48:02.170012

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------
2+10 system with 500 pods, power down all nodes and power up all nodes.

Expected Behavior
------------------
Expected all nodes to recover after DOR.

Actual Behavior
----------------
2 compute nodes stuck in degraded.

Reproducibility
---------------
seen once.

System Configuration
--------------------
2+10 system with 500 pods.

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-05-09"

Last Pass
---------
First try.

Timestamp/Logs
--------------
Wed May 15 14:46:25 EDT 2019 login prompt of controllers after power on

Test Activity
-------------
platform testing

Nimalini Rasa (nrasa) wrote :
Nimalini Rasa (nrasa) wrote :
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.

Here are the logs for compute-7 ; same signature for compute-3

2019-05-15T18:54:36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp (1712) alarm_enabled_clear : Info : compute-7 enable alarm clear
2019-05-15T18:54:36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'In-Service' alarm (200.004)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'Configuration' alarm (200.011)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-7 sending 'host' state change to vim (enabled)
2019-05-15T18:54:36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1567) enable_handler : Info : compute-7 is ENABLED

This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.

2019-05-15T18:54:37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp ( 540) workQueue_process :Error : compute-3 mtcVimApi_state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1)
2019-05-15T18:54:37.534 fmAPI.cpp(131): Failed to connect to FM Manager.

The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.

2019-05-15T18:54:34.881 fmMsgServer.cpp(414): Send response for create log, uuid:(6b224ca2-2d80-4481-a865-d12faaf23081) (0)

When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.

Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.

The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.

Eric MacDonald (rocksolidmtce) wrote :

To reproduce this issue there needs to be hosts enabling while alarm 200.004 is asserted and before the enable is complete there is a swact. Mtce on the new side reads the alarm, makes the host go degraded and never takes action to deal with it.

Recovery is by lock/unlock of affected host(s)

Numan Waheed (nwaheed) on 2019-05-17
tags: added: stx.retestneeded
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers