PMON config reload results in the active monitoring failure of the hbsClient process.
The hbsClient process is failed and restarted properly but should not have failed over a pmon config reload.
Severity : Minor since the process is auto recovered with no customer visible impact
Steps to Reproduce: Edit any config file in /etc/pmon.d to cause a config reload.
Expected Behavior: no processes should fail over the config reload
Actual Behavior: hbsClient process fails active monitoring only to be successfully recovered on first attempt.
Reproducibility: Reproducible
System Configuration: Any
Branch/Pull Time/Commit: All current loads as of this date.
Timestamp/Logs
--------------
2018-12-10T15:34:28.536 [35598.00525] controller-0 pmond mon pmonHdlr.cpp (1901) pmon_service : Info : Setting config reload flag
2018-12-10T15:34:38.985 [35598.00601] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 1) (0.0.0.0)
2018-12-10T15:34:38.985 [35598.00602] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (0:1)
2018-12-10T15:34:45.485 [35598.00603] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 2) (0.0.0.0)
2018-12-10T15:34:45.485 [35598.00604] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (1:2)
2018-12-10T15:34:51.985 [35598.00605] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 3) (0.0.0.0)
2018-12-10T15:34:51.985 [35598.00606] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (2:3)
2018-12-10T15:34:58.485 [35598.00607] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 4) (0.0.0.0)
2018-12-10T15:34:58.485 [35598.00608] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (3:4)
2018-12-10T15:35:03.985 [35598.00609] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 5) (0.0.0.0)
2018-12-10T15:35:03.985 [35598.00610] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (4:5)
2018-12-10T15:35:04.489 [35598.00611] controller-0 pmond com nodeUtil.cpp (1927) get_system_state : Info : systemctl reports host as 'degraded'
2018-12-10T15:35:04.489 [35598.00612] controller-0 pmond mon pmonHdlr.cpp ( 320) manage_process_failure :Error : hbsClient failed (2712) (p:0 a:1)
2018-12-10T15:35:04.985 [35598.00613] controller-0 pmond mon pmonFsm.cpp ( 509) pmon_passive_handler : Info : hbsClient Sending Log Event to Maintenance
2018-12-10T15:35:04.985 [35598.00614] controller-0 pmond mon pmonHdlr.cpp (1540) manage_alarm : Info : hbsClient process has failed ; Auto recovery in progress.
2018-12-10T15:35:05.486 [35598.00615] controller-0 pmond mon pmonFsm.cpp ( 562) pmon_passive_handler : Info : hbsClient stability period (10 secs)
2018-12-10T15:35:05.486 [35598.00616] controller-0 pmond mon pmonHdlr.cpp (1102) unregister_process : Info : hbsClient unregistered (2712)
2018-12-10T15:35:05.486 [35598.00617] controller-0 pmond mon pmonHdlr.cpp (1204) respawn_process : Info : hbsClient restart of running process
2018-12-10T15:35:05.486 [35598.00618] controller-0 pmond mon pmonHdlr.cpp ( 942) kill_running_process : Warn : hbsClient kill succeeded (2712)
2018-12-10T15:35:05.488 [35598.00619] controller-0 pmond mon pmonHdlr.cpp (1304) respawn_process : Info : hbsClient Spawn (20171)
2018-12-10T15:35:06.985 [35598.00620] controller-0 pmond mon pmonFsm.cpp ( 619) pmon_passive_handler : Info : hbsClient Monitor (20187)
2018-12-10T15:35:17.485 [35598.00621] controller-0 pmond mon pmonFsm.cpp ( 653) pmon_passive_handler : Info : hbsClient Stable (20187)
2018-12-10T15:35:17.985 [35598.00622] controller-0 pmond mon pmonFsm.cpp ( 725) pmon_passive_handler : Info : hbsClient Recovered (20187)
2018-12-10T15:35:17.985 [35598.00623] controller-0 pmond mon pmonHdlr.cpp (1137) register_process : Info : hbsClient Registered (20187)
2018-12-10T15:35:18.603 [35598.00624] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:23.985 [35598.00625] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:28.603 [35598.00626] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:33.485 [35598.00627] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:38.485 [35598.00628] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:43.485 [35598.00629] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:43.485 [35598.00630] controller-0 pmond mon pmonFsm.cpp ( 303) pmon_active_handler : Info : hbsClient Debounced (20187)
Reviewed: https:/ /review. openstack. org/624784 /git.openstack. org/cgit/ openstack/ stx-metal/ commit/ ?id=4e132af3088 4d2a81700be0667 a7b03cab1d3d94
Committed: https:/
Submitter: Zuul
Branch: master
commit 4e132af30884d2a 81700be0667a7b0 3cab1d3d94
Author: Eric MacDonald <email address hidden>
Date: Wed Dec 12 08:10:30 2018 -0500
Mtce: fix hbsClient active monitoring over config reload
The maintenance process monitor is failing the hbsClient
process over config or process reload operations.
The issue relates to the hbsClient's subfunction being
'last-config' without pmon properly gating the active
monitoring FSM from starting until the passive monitoring
phase is complete and in the MANAGE state.
Test Plan
PASS: Verify active monitoring failure detection and handling
PASS: Verify proper process monitoring over pmond config reload
PASS: Verify proper process monitoring over SIGHUP -> pmond
PASS: Verify proper process monitoring over SIGUSR2 -> pmond
PASS: Verify proper process monitoring over process failure recovery
PASS: Verify pmond regression test soak ; on active and inactive controllers
PASS: Verify pmond regression test soak ; on compute node
PASS: Verify pmond regression test soak ; kill/recovery function
PASS: Verify pmond regression test soak ; restart function
PASS: Verify pmond regression test soak ; alarming function
PASS: Verify pmond handles critical process failure with no restart config
PASS: Verify pmond handles ntpd process failure
PASS: Verify AIO DX Install
PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.
Change-Id: Ie2fe7b6ce479f6 60725e5600498cc 98f36f78337
Closes-Bug: 1807724
Signed-off-by: Eric MacDonald <email address hidden>