StarlingX

Bug #1841653
Comment #5

Comment 5 for bug 1841653

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

The hbsAgent manages this alarm

2019-08-27T11:46:39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp (7636) manage_pulse_flags : Warn : controller-0 sending pmon degrade event to maintenance
2019-08-27T11:46:39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond set major 200.006

... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.

2019-08-27T11:46:50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond clear clear 200.006

In this case the mtcalarm got a failure from FM trying to clear it.

019-08-27T11:46:50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp ( 42) alarmMgr_manage_alarm : Info : Alarm: alarmid:200.006 hostname:controller-0 operation:clear severity:clear entity:pmond prefix:
2019-08-27T11:46:50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp ( 316) alarmUtil : Info : controller-0 clearing 200.006 host=controller-0.process=pmond alarm
2019-08-27T11:46:50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe)
2019-08-27T11:46:50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp ( 321) alarmUtil :Error : controller-0 failed to fm_clear_fault (rc:3)
2019-08-27T11:46:50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp ( 101) alarmMgr_manage_alarm :Error : controller-0 failed to clear alarm '200.006:pmond'
2019-08-27T11:46:50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp ( 245) daemon_service_run : Warn : failed to handle alarm request (rc:1)
2019-08-27T11:46:50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr : Info : Received SIGPIPE
2019-08-27T11:47:48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp ( 251) daemon_service_run : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)

mtcalarm daemon currently does not do retries.

Adding retries is non-trivial.
That would require adding a first in first out message queue.

The hbsAgent manages this alarm

2019-08-27T11:46:39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp     (7636) manage_pulse_flags      : Warn : controller-0 sending pmon degrade event to maintenance
2019-08-27T11:46:39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp         ( 146) alarm_                  : Info : controller-0 pmond set major 200.006

... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.

2019-08-27T11:46:50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp         ( 146) alarm_                  : Info : controller-0 pmond clear clear 200.006

In this case the mtcalarm got a failure from FM trying to clear it.

019-08-27T11:46:50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp      (  42) alarmMgr_manage_alarm   : Info : Alarm: alarmid:200.006 hostname:controller-0 operation:clear severity:clear entity:pmond prefix:
2019-08-27T11:46:50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp     ( 316) alarmUtil               : Info : controller-0 clearing 200.006 host=controller-0.process=pmond alarm
2019-08-27T11:46:50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe)
2019-08-27T11:46:50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp     ( 321) alarmUtil               :Error : controller-0 failed to fm_clear_fault (rc:3)
2019-08-27T11:46:50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp      ( 101) alarmMgr_manage_alarm   :Error : controller-0 failed to clear alarm '200.006:pmond'
2019-08-27T11:46:50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp     ( 245) daemon_service_run      : Warn : failed to handle alarm request (rc:1)
2019-08-27T11:46:50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr      : Info : Received SIGPIPE
2019-08-27T11:47:48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp     ( 251) daemon_service_run      : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)

mtcalarm daemon currently does not do retries.

Adding retries is non-trivial.
That would require adding a first in first out message queue.