StarlingX

2 computes in degraded state after DOR

Bug #1829289 reported by Nimalini Rasa on 2019-05-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Bin Qian

Bug Description

Brief Description
-----------------
After DOR (Dead Office Recovery), 2 out of 10 computes stuck in degraded state. The following alarms were found:
200.004 compute-3 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-3 critical 2019-05-15T18:48:03.140007
200.004 compute-7 experienced a service-affecting failure. Auto-recovery in progress. Manual Lock and Unlock may be required if auto-recovery is unsuccessful. host=compute-7 critical 2019-05-15T18:48:02.170012

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------
2+10 system with 500 pods, power down all nodes and power up all nodes.

Expected Behavior
------------------
Expected all nodes to recover after DOR.

Actual Behavior
----------------
2 compute nodes stuck in degraded.

Reproducibility
---------------
seen once.

System Configuration
--------------------
2+10 system with 500 pods.

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2019-05-09"

Last Pass
---------
First try.

Timestamp/Logs
--------------
Wed May 15 14:46:25 EDT 2019 login prompt of controllers after power on

Test Activity
-------------
platform testing

Tags:

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-15:

Compute nodes logs Edit (73.0 MiB, application/x-tar)

Revision history for this message

Nimalini Rasa (nrasa) wrote on 2019-05-15:

Controller logs Edit (76.5 MiB, application/x-tar)

Eric MacDonald (rocksolidmtce) on 2019-05-16

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-16:

Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.

Here are the logs for compute-7 ; same signature for compute-3

2019-05-15T18:54:36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp (1712) alarm_enabled_clear : Info : compute-7 enable alarm clear
2019-05-15T18:54:36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'In-Service' alarm (200.004)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'Configuration' alarm (200.011)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-7 sending 'host' state change to vim (enabled)
2019-05-15T18:54:36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1567) enable_handler : Info : compute-7 is ENABLED

This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.

2019-05-15T18:54:37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp ( 540) workQueue_process :Error : compute-3 mtcVimApi_state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1)
2019-05-15T18:54:37.534 fmAPI.cpp(131): Failed to connect to FM Manager.

The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.

2019-05-15T18:54:34.881 fmMsgServer.cpp(414): Send response for create log, uuid:(6b224ca2-2d80-4481-a865-d12faaf23081) (0)

When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.

Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.

The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.

Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.

Here are the logs for compute-7 ; same signature for compute-3

2019-05-15T18:54:36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp     (1712) alarm_enabled_clear     : Info : compute-7 enable alarm clear
2019-05-15T18:54:36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp      ( 396) mtcAlarm_clear          : Info : compute-7 clearing 'In-Service' alarm (200.004)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp      ( 396) mtcAlarm_clear          : Info : compute-7 clearing 'Configuration' alarm (200.011)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp     ( 258) mtcVimApi_state_change  : Info : compute-7 sending 'host' state change to vim (enabled)
2019-05-15T18:54:36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp  (1567) enable_handler          : Info : compute-7 is ENABLED

This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.

2019-05-15T18:54:37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp  ( 540) workQueue_process       :Error : compute-3 mtcVimApi_state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1)
2019-05-15T18:54:37.534 fmAPI.cpp(131): Failed to connect to FM Manager.

The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.

2019-05-15T18:54:34.881 fmMsgServer.cpp(414): Send response for create log, uuid:(6b224ca2-2d80-4481-a865-d12faaf23081) (0)

When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.

Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.

The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-16:

To reproduce this issue there needs to be hosts enabling while alarm 200.004 is asserted and before the enable is complete there is a swact. Mtce on the new side reads the alarm, makes the host go degraded and never takes action to deal with it.

Recovery is by lock/unlock of affected host(s)

Numan Waheed (nwaheed) on 2019-05-17

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-22:

Marking as release gating; system doesn't fully recover after a dead office recovery test

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
tags:	added: stx.2.0 stx.metal

Bin Qian (bqian20) on 2019-07-09

Changed in starlingx:
assignee:	Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-12: Fix proposed to ha (master)

Fix proposed to branch: master
Review: https://review.opendev.org/670603

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-18: Fix merged to ha (master)

Reviewed: https://review.opendev.org/670603
Committed: https://git.openstack.org/cgit/starlingx/ha/commit/?id=a729bbabc64596f9bfafd78d601d9129c7d3c614
Submitter: Zuul
Branch: master

commit a729bbabc64596f9bfafd78d601d9129c7d3c614
Author: Bin Qian <email address hidden>
Date: Fri Jul 12 13:00:09 2019 -0400

Add mtc-agent service dependency to fm-mgr

Add mtc-agent service dependency to fm-mgr to ensure mtc-agent shuts
down before fm-mgr does.

    An issue was found that in rare cases a swact occurs when mtc-agent
    try to clear an alarm, while fm-mgr has been disabled, clear alarm
    message went lost. The alarm therefor remained not being able to
    clear.

Closes-bug 1829289

Change-Id: I39196d5f3ce764a14b4d1e0fb1a4f3344ddd6a1a
Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-08-15:

This was verified in 2+2+4 system load "20190814T053000Z".

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-08-15:

Verified in system with 2+2+8 load 2019-08-12_20-59-00 with 500 pods.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.