The hbsAgent manages this alarm
2019-08-27T11:46:39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp (7636) manage_pulse_flags : Warn : controller-0 sending pmon degrade event to maintenance 2019-08-27T11:46:39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond set major 200.006
... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.
2019-08-27T11:46:50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond clear clear 200.006
In this case the mtcalarm got a failure from FM trying to clear it.
019-08-27T11:46:50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp ( 42) alarmMgr_manage_alarm : Info : Alarm: alarmid:200.006 hostname:controller-0 operation:clear severity:clear entity:pmond prefix: 2019-08-27T11:46:50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp ( 316) alarmUtil : Info : controller-0 clearing 200.006 host=controller-0.process=pmond alarm 2019-08-27T11:46:50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe) 2019-08-27T11:46:50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp ( 321) alarmUtil :Error : controller-0 failed to fm_clear_fault (rc:3) 2019-08-27T11:46:50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp ( 101) alarmMgr_manage_alarm :Error : controller-0 failed to clear alarm '200.006:pmond' 2019-08-27T11:46:50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp ( 245) daemon_service_run : Warn : failed to handle alarm request (rc:1) 2019-08-27T11:46:50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr : Info : Received SIGPIPE 2019-08-27T11:47:48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp ( 251) daemon_service_run : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)
mtcalarm daemon currently does not do retries.
Adding retries is non-trivial. That would require adding a first in first out message queue.
The hbsAgent manages this alarm
2019-08- 27T11:46: 39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp (7636) manage_pulse_flags : Warn : controller-0 sending pmon degrade event to maintenance 27T11:46: 39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond set major 200.006
2019-08-
... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.
2019-08- 27T11:46: 50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond clear clear 200.006
In this case the mtcalarm got a failure from FM trying to clear it.
019-08- 27T11:46: 50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp ( 42) alarmMgr_ manage_ alarm : Info : Alarm: alarmid:200.006 hostname: controller- 0 operation:clear severity:clear entity:pmond prefix: 27T11:46: 50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp ( 316) alarmUtil : Info : controller-0 clearing 200.006 host=controller -0.process= pmond alarm 27T11:46: 50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe) 27T11:46: 50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp ( 321) alarmUtil :Error : controller-0 failed to fm_clear_fault (rc:3) 27T11:46: 50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp ( 101) alarmMgr_ manage_ alarm :Error : controller-0 failed to clear alarm '200.006:pmond' 27T11:46: 50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp ( 245) daemon_service_run : Warn : failed to handle alarm request (rc:1) 27T11:46: 50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr : Info : Received SIGPIPE 27T11:47: 48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp ( 251) daemon_service_run : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
mtcalarm daemon currently does not do retries.
Adding retries is non-trivial.
That would require adding a first in first out message queue.