2 computes in degraded state after DOR
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Bin Qian |
Bug Description
Brief Description
-----------------
After DOR (Dead Office Recovery), 2 out of 10 computes stuck in degraded state. The following alarms were found:
200.004 compute-3 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-3 critical 2019-05-
200.004 compute-7 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-7 critical 2019-05-
Severity
--------
Provide the severity of the defect.
Major
Steps to Reproduce
------------------
2+10 system with 500 pods, power down all nodes and power up all nodes.
Expected Behavior
------------------
Expected all nodes to recover after DOR.
Actual Behavior
----------------
2 compute nodes stuck in degraded.
Reproducibility
---------------
seen once.
System Configuration
-------
2+10 system with 500 pods.
Branch/Pull Time/Commit
-------
BUILD_ID=
Last Pass
---------
First try.
Timestamp/Logs
--------------
Wed May 15 14:46:25 EDT 2019 login prompt of controllers after power on
Test Activity
-------------
platform testing
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.retestneeded |
Changed in starlingx: | |
assignee: | Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20) |
Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.
Here are the logs for compute-7 ; same signature for compute-3
2019-05- 15T18:54: 36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp (1712) alarm_enabled_clear : Info : compute-7 enable alarm clear 15T18:54: 36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'In-Service' alarm (200.004) 15T18:54: 36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7) 15T18:54: 36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'Configuration' alarm (200.011) 15T18:54: 36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7) 15T18:54: 36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_ state_change : Info : compute-7 sending 'host' state change to vim (enabled) 15T18:54: 36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1567) enable_handler : Info : compute-7 is ENABLED
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.
2019-05- 15T18:54: 37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp ( 540) workQueue_process :Error : compute-3 mtcVimApi_ state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1) 15T18:54: 37.534 fmAPI.cpp(131): Failed to connect to FM Manager.
2019-05-
The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.
2019-05- 15T18:54: 34.881 fmMsgServer. cpp(414) : Send response for create log, uuid:(6b224ca2- 2d80-4481- a865-d12faaf230 81) (0)
When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.
Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.
The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.