StarlingX

200.006 alarm "(pmond) process is not running or functioning properly" not cleared after hbsClient killed and recovered

Bug #1841653 reported by Peng Peng on 2019-08-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
After hbsClient process kill and recover testing was done, 200.006 alarm: controller-0 'Process Monitor' (pmond) process is not running or functioning properly. The system is trying to recover this process.
was not cleared

Severity
--------
Major

Steps to Reproduce
------------------
kill process hbsClient
wait for process hbsClient recover
check alarm-list

TC-name: mtc/test_process_monitoring.py::test_process_monitoring[hbsClient]

Expected Behavior
------------------
200.006 alarm cleared after testing

Actual Behavior
----------------
200.006 alarm not cleared

Reproducibility
---------------
Seen once

System Configuration
--------------------
One node system

Lab-name: SM-3

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-08-26_20-59-00

Last Pass
---------
2019-08-24_20-59-00

Timestamp/Logs
--------------
[2019-08-27 11:44:25,041] 771 DEBUG MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: Kill process and verify system behavior for PMON process:hbsClient, impact=enabled-degraded, process_type=pmon
[2019-08-27 11:44:25,041] 466 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2019-08-27 11:44:25,041] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne event-list -l 2 -q 'event_log_id=200.006' --nowrap --nopaging'
[2019-08-27 11:44:26,954] 423 DEBUG MainThread ssh.expect :: Output:
+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| Time Stamp | State | Event Log ID | Reason Text | Entity Instance ID | Severity |
+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
| 2019-08-27T11:13:37.152211 | clear | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major |
| 2019-08-27T11:11:25.657507 | set | 200.006 | controller-0 is degraded due to the failure of its 'pci-irq-affinity-agent' process. Auto recovery of this major process is in progress. | host=controller-0.process=pci-irq-affinity-agent | major |
+----------------------------+-------+--------------+------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+----------+
controller-0:~$
[2019-08-27 11:44:26,954] 301 DEBUG MainThread ssh.send :: Send 'echo $?'
[2019-08-27 11:44:27,057] 423 DEBUG MainThread ssh.expect :: Output:
0
controller-0:~$
[2019-08-27 11:44:27,057] 800 INFO MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: retries=4, interval=1, debounce=10, wait_each_kill=15
[2019-08-27 11:44:27,057] 809 INFO MainThread test_process_monitoring.kill_pmon_process_and_verify_impact:: Attempt to kill process:hbsClient on host:controller-0, cli:
true; n=1; last_pid=''; pid=''; for((;n<4;)); do pid=\$(cat /var/run/hbsClient.pid 2>/dev/null); date;
                if [ "x\$pid" = "x" -o "\$pid" = "\$last_pid" ]; then echo "stale or empty PID:\$pid, last_pid=\$last_pid";
                sleep 0.5; continue; fi; echo "Li69nux*" | sudo -S kill -9 \$pid &>/dev/null;
                if [ \$? -eq 0 ]; then echo "OK \$n - \$pid killed"; ((n++)); last_pid=\$pid; pid=''; sleep 15;
                else sleep 0.5; fi; done; echo \$pid

2019-08-27 11:44:27,629] 301 DEBUG MainThread ssh.send :: Send 'nohup ./test_process.sh > ./results.txt 2>&1 &'
[2019-08-27 11:44:27,738] 423 DEBUG MainThread ssh.expect :: Output:
[1] 536347

[2019-08-27 12:00:04,321] 301 DEBUG Monitor-hbsClient--1 ssh.send :: Send 'cat /var/run/hbsClient.pid 2>/dev/null | head -n1 && echo 2>/dev/null'
[2019-08-27 12:00:04,424] 423 DEBUG Monitor-hbsClient--1 ssh.expect :: Output:
89983

[2019-08-27 12:00:04,638] 301 DEBUG Monitor-hbsClient--1 ssh.send :: Send 'ps -p 89983'
[2019-08-27 12:00:04,771] 423 DEBUG Monitor-hbsClient--1 ssh.expect :: Output:
PID TTY TIME CMD
89983 ? 00:00:00 hbsClient

[2019-08-27 12:05:53,988] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-27 12:05:55,554] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| 1ff85c57-4f9d-4da8-9397-cbec17a0ccd5 | 100.114 | NTP address 2607:5300:60:92 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607:5300:60:92 | minor | 2019-08-27T11:53:15.553284 |
| a1d8e7a5-d851-4f2e-8d3d-d93116f2159d | 200.006 | controller-0 'Process Monitor' (pmond) process is not running or functioning properly. The system is trying to recover this process. | host=controller-0.process=pmond | major | 2019-08-27T11:46:39.027741 |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
controller-0:~$

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-27:

ALL_NODES_20190827.153927.tar Edit (26.0 MiB, application/x-tar)

Numan Waheed (nwaheed) on 2019-08-28

tags:

added: stx.retestneeded

Dariush Eslimi (deslimi) on 2019-08-28

Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Eric MacDonald (rocksolidmtce)
tags:	added: stx.3.0 stx.metal

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

The hbsAgent manages this alarm

2019-08-27T11:46:39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp (7636) manage_pulse_flags : Warn : controller-0 sending pmon degrade event to maintenance
2019-08-27T11:46:39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond set major 200.006

... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.

2019-08-27T11:46:50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp ( 146) alarm_ : Info : controller-0 pmond clear clear 200.006

In this case the mtcalarm got a failure from FM trying to clear it.

019-08-27T11:46:50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp ( 42) alarmMgr_manage_alarm : Info : Alarm: alarmid:200.006 hostname:controller-0 operation:clear severity:clear entity:pmond prefix:
2019-08-27T11:46:50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp ( 316) alarmUtil : Info : controller-0 clearing 200.006 host=controller-0.process=pmond alarm
2019-08-27T11:46:50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe)
2019-08-27T11:46:50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp ( 321) alarmUtil :Error : controller-0 failed to fm_clear_fault (rc:3)
2019-08-27T11:46:50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp ( 101) alarmMgr_manage_alarm :Error : controller-0 failed to clear alarm '200.006:pmond'
2019-08-27T11:46:50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp ( 245) daemon_service_run : Warn : failed to handle alarm request (rc:1)
2019-08-27T11:46:50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr : Info : Received SIGPIPE
2019-08-27T11:47:48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp ( 251) daemon_service_run : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)

mtcalarm daemon currently does not do retries.

Adding retries is non-trivial.
That would require adding a first in first out message queue.

The hbsAgent manages this alarm

2019-08-27T11:46:39.026 [89435.00045] controller-0 hbsAgent hbs nodeClass.cpp     (7636) manage_pulse_flags      : Warn : controller-0 sending pmon degrade event to maintenance
2019-08-27T11:46:39.027 [89435.00046] controller-0 hbsAgent alm alarm.cpp         ( 146) alarm_                  : Info : controller-0 pmond set major 200.006

... and did try and clear it 11 seconds later by sending a message to the mtcalarm daemon which does the work.

2019-08-27T11:46:50.267 [89435.00047] controller-0 hbsAgent alm alarm.cpp         ( 146) alarm_                  : Info : controller-0 pmond clear clear 200.006

In this case the mtcalarm got a failure from FM trying to clear it.

019-08-27T11:46:50.271 [89360.00011] localhost mtcalarmd --- alarmMgr.cpp      (  42) alarmMgr_manage_alarm   : Info : Alarm: alarmid:200.006 hostname:controller-0 operation:clear severity:clear entity:pmond prefix:
2019-08-27T11:46:50.271 [89360.00012] localhost mtcalarmd --- alarmUtil.cpp     ( 316) alarmUtil               : Info : controller-0 clearing 200.006 host=controller-0.process=pmond alarm
2019-08-27T11:46:50.271 fmSocket.cpp(140): Socket Error: Failed to write to fd:(5), len:(526), rc:(-1), error:(Broken pipe)
2019-08-27T11:46:50.271 [89360.00013] localhost mtcalarmd --- alarmUtil.cpp     ( 321) alarmUtil               :Error : controller-0 failed to fm_clear_fault (rc:3)
2019-08-27T11:46:50.271 [89360.00014] localhost mtcalarmd --- alarmMgr.cpp      ( 101) alarmMgr_manage_alarm   :Error : controller-0 failed to clear alarm '200.006:pmond'
2019-08-27T11:46:50.271 [89360.00015] localhost mtcalarmd --- alarmInit.cpp     ( 245) daemon_service_run      : Warn : failed to handle alarm request (rc:1)
2019-08-27T11:46:50.271 [89360.00016] localhost mtcalarmd sig daemon_signal.cpp ( 222) daemon_signal_hdlr      : Info : Received SIGPIPE
2019-08-27T11:47:48.193 [89360.00017] localhost mtcalarmd --- alarmInit.cpp     ( 251) daemon_service_run      : Warn : alarm request receive error ; thresholeded ; (11:Resource temporarily unavailable)

mtcalarm daemon currently does not do retries.

Adding retries is non-trivial.
That would require adding a first in first out message queue.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-20:

In the past, the hbsAgent was managed by SM which has dependencies to ensure FM is running before hbsAgent. Now hbsAgent is a systemd process so there is no guarantee FM is running when it might try and raise an alarm.

This will have to be considered in the fix strategy.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685377

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-07: Fix merged to metal (master)

Reviewed: https://review.opendev.org/685377
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=f2fedc0446bc39ca59e9512dc7418ed7c02b67a1
Submitter: Zuul
Branch: master

commit f2fedc0446bc39ca59e9512dc7418ed7c02b67a1
Author: Eric MacDonald <email address hidden>
Date: Fri Sep 27 11:04:33 2019 -0400

Add alarm retry support to maintenance alarm handling daemon

    The maintenance alarm handling daemon (mtcalarmd) should not
    drop alarm requests simply because FM process is not running.
    Insteads it should retry for it and other FM error cases that
    will likely succeed in time if they are retried.

Some error cases however do need to be dropped such as those
that are unlikely to succeed with retries.

Reviewed FM return codes with FM designer which lead to a list
of errors that should drop and others that should retry.

This update implements that handling with a posting and
servicing of a first-in / first-out alarm queue.

Typical retry case is the NOCONNECT error code which occurs
when FM is not running.

Alarm ordering and first try timestamp is maintained.
Retries and logs are throttled to avoid flooding.

Test Plan:

    PASS: Verify success path alarm handling End-to-End.
    PASS: Verify retry handling while FM is not running.
    PASS: Verify handling of all FM error codes (fit tool).
    PASS: Verify alarm handling under stress (inject-alarm script) soak.
    PASS: verify no memory leak over stress soak.
    PASS: Verify logging (success, retry, failure)
    PASS: Verify alarm posted date is maintained over retry success.

    Change-Id: Icd1e75583ef660b767e0788dd4af7f184bdb9e86
    Closes-Bug: 1841653
    Signed-off-by: Eric MacDonald <email address hidden>