Comment 3 for bug 1829289

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Controller-1 came up after the DOR and maintenance was in the process of handling recovery of all nodes. compute-3 and compute-7 were the last ones when an event caused a swact.

Here are the logs for compute-7 ; same signature for compute-3

2019-05-15T18:54:36.685 [57765.01021] controller-1 mtcAgent hbs nodeClass.cpp (1712) alarm_enabled_clear : Info : compute-7 enable alarm clear
2019-05-15T18:54:36.685 [57765.01022] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'In-Service' alarm (200.004)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.004), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01023] controller-1 mtcAgent alm mtcAlarm.cpp ( 396) mtcAlarm_clear : Info : compute-7 clearing 'Configuration' alarm (200.011)
2019-05-15T18:54:36.685 fmAPI.cpp(492): Enqueue clear alarm request: alarm id (200.011), instant id (host=compute-7)
2019-05-15T18:54:36.685 [57765.01024] controller-1 mtcAgent vim mtcVimApi.cpp ( 258) mtcVimApi_state_change : Info : compute-7 sending 'host' state change to vim (enabled)
2019-05-15T18:54:36.685 [57765.01025] controller-1 mtcAgent |-| mtcNodeHdlrs.cpp (1567) enable_handler : Info : compute-7 is ENABLED

This showed that maintenance enqueued Config and Enable alarm clears followed by a failure to communicate with the VM and FM.

2019-05-15T18:54:37.415 [57765.01045] controller-1 mtcAgent --- mtcWorkQueue.cpp ( 540) workQueue_process :Error : compute-3 mtcVimApi_state_change seq:16 Failed (rc:29) - (3 of 3) (work->done) (Critical:Yes) (Total Fails:1)
2019-05-15T18:54:37.534 fmAPI.cpp(131): Failed to connect to FM Manager.

The last log in FM (below) has a dated log that shows it was stopped before these alarm clears could get serviced.

2019-05-15T18:54:34.881 fmMsgServer.cpp(414): Send response for create log, uuid:(6b224ca2-2d80-4481-a865-d12faaf23081) (0)

When maintenance came up on the other side it loaded its alarms from FM and found the 200.004's for compute7/3. Loading the critical alarm for these hosts put them in the degraded state.

Issue is that the mtce in-service test handler did not take actions to verify the Enable state health and clear the condition.

The issue is caused by a race condition during an uncontrolled swact that was not handled/recovered properly on the new side leading to a stuck alarm and degrade condition.