200.006 alarm "(pmond) process is not running or functioning properly" not cleared after hbsClient killed and recovered
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
After hbsClient process kill and recover testing was done, 200.006 alarm: controller-0 'Process Monitor' (pmond) process is not running or functioning properly. The system is trying to recover this process.
was not cleared
Severity
--------
Major
Steps to Reproduce
------------------
kill process hbsClient
wait for process hbsClient recover
check alarm-list
TC-name: mtc/test_
Expected Behavior
------------------
200.006 alarm cleared after testing
Actual Behavior
----------------
200.006 alarm not cleared
Reproducibility
---------------
Seen once
System Configuration
-------
One node system
Lab-name: SM-3
Branch/Pull Time/Commit
-------
stx master as of 2019-08-26_20-59-00
Last Pass
---------
2019-08-24_20-59-00
Timestamp/Logs
--------------
[2019-08-27 11:44:25,041] 771 DEBUG MainThread test_process_
[2019-08-27 11:44:25,041] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-27 11:44:25,041] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
[2019-08-27 11:44:26,954] 423 DEBUG MainThread ssh.expect :: Output:
+------
| Time Stamp | State | Event Log ID | Reason Text | Entity Instance ID | Severity |
+------
| 2019-08-
| 2019-08-
+------
controller-0:~$
[2019-08-27 11:44:26,954] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-27 11:44:27,057] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2019-08-27 11:44:27,057] 800 INFO MainThread test_process_
[2019-08-27 11:44:27,057] 809 INFO MainThread test_process_
true; n=1; last_pid=''; pid=''; for((;n<4;)); do pid=\$(cat /var/run/
if [ "x\$pid" = "x" -o "\$pid" = "\$last_pid" ]; then echo "stale or empty PID:\$pid, last_pid=
if [ \$? -eq 0 ]; then echo "OK \$n - \$pid killed"; ((n++)); last_pid=\$pid; pid=''; sleep 15;
2019-08-27 11:44:27,629] 301 DEBUG MainThread ssh.send :: Send 'nohup ./test_process.sh > ./results.txt 2>&1 &'
[2019-08-27 11:44:27,738] 423 DEBUG MainThread ssh.expect :: Output:
[1] 536347
[2019-08-27 12:00:04,321] 301 DEBUG Monitor-
[2019-08-27 12:00:04,424] 423 DEBUG Monitor-
89983
[2019-08-27 12:00:04,638] 301 DEBUG Monitor-
[2019-08-27 12:00:04,771] 423 DEBUG Monitor-
PID TTY TIME CMD
89983 ? 00:00:00 hbsClient
[2019-08-27 12:05:53,988] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://
[2019-08-27 12:05:55,554] 423 DEBUG MainThread ssh.expect :: Output:
+------
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 1ff85c57-
| a1d8e7a5-
+------
controller-0:~$
Test Activity
-------------
Sanity
tags: | added: stx.retestneeded |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.3.0 stx.metal |
The hbsAgent manages this alarm
2019-08- 27T11:46: 39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp (7636) manage_pulse_flags : Warn : controller-0 sending pmon degrade event to maintenance 27T11:46: 39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond set major 200.006
2019-08-
... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.
2019-08- 27T11:46: 50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond clear clear 200.006
In this case the mtcalarm got a failure from FM trying to clear it.
019-08- 27T11:46: 50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp ( 42) alarmMgr_ manage_ alarm : Info : Alarm: alarmid:200.006 hostname: controller- 0 operation:clear severity:clear entity:pmond prefix: 27T11:46: 50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp ( 316) alarmUtil : Info : controller-0 clearing 200.006 host=controller -0.process= pmond alarm 27T11:46: 50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe) 27T11:46: 50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp ( 321) alarmUtil :Error : controller-0 failed to fm_clear_fault (rc:3) 27T11:46: 50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp ( 101) alarmMgr_ manage_ alarm :Error : controller-0 failed to clear alarm '200.006:pmond' 27T11:46: 50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp ( 245) daemon_service_run : Warn : failed to handle alarm request (rc:1) 27T11:46: 50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr : Info : Received SIGPIPE 27T11:47: 48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp ( 251) daemon_service_run : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
2019-08-
mtcalarm daemon currently does not do retries.
Adding retries is non-trivial.
That would require adding a first in first out message queue.