hbsClient fails active monitoring after pmon config reload

Bug #1807724 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

PMON config reload results in the active monitoring failure of the hbsClient process.
The hbsClient process is failed and restarted properly but should not have failed over a pmon config reload.

Severity : Minor since the process is auto recovered with no customer visible impact
Steps to Reproduce: Edit any config file in /etc/pmon.d to cause a config reload.
Expected Behavior: no processes should fail over the config reload
Actual Behavior: hbsClient process fails active monitoring only to be successfully recovered on first attempt.
Reproducibility: Reproducible
System Configuration: Any
Branch/Pull Time/Commit: All current loads as of this date.
Timestamp/Logs
--------------

2018-12-10T15:34:28.536 [35598.00525] controller-0 pmond mon pmonHdlr.cpp (1901) pmon_service : Info : Setting config reload flag
2018-12-10T15:34:38.985 [35598.00601] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 1) (0.0.0.0)
2018-12-10T15:34:38.985 [35598.00602] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (0:1)
2018-12-10T15:34:45.485 [35598.00603] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 2) (0.0.0.0)
2018-12-10T15:34:45.485 [35598.00604] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (1:2)
2018-12-10T15:34:51.985 [35598.00605] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 3) (0.0.0.0)
2018-12-10T15:34:51.985 [35598.00606] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (2:3)
2018-12-10T15:34:58.485 [35598.00607] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 4) (0.0.0.0)
2018-12-10T15:34:58.485 [35598.00608] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (3:4)
2018-12-10T15:35:03.985 [35598.00609] controller-0 pmond mon pmonMsg.cpp ( 555) amon_send_request :Error : hbsClient sendto error (88:Socket operation on non-socket) (hbsClient 12345678 5) (0.0.0.0)
2018-12-10T15:35:03.985 [35598.00610] controller-0 pmond mon pmonFsm.cpp ( 241) pmon_active_handler : Warn : hbsClient pulse request send failed (4:5)
2018-12-10T15:35:04.489 [35598.00611] controller-0 pmond com nodeUtil.cpp (1927) get_system_state : Info : systemctl reports host as 'degraded'
2018-12-10T15:35:04.489 [35598.00612] controller-0 pmond mon pmonHdlr.cpp ( 320) manage_process_failure :Error : hbsClient failed (2712) (p:0 a:1)
2018-12-10T15:35:04.985 [35598.00613] controller-0 pmond mon pmonFsm.cpp ( 509) pmon_passive_handler : Info : hbsClient Sending Log Event to Maintenance
2018-12-10T15:35:04.985 [35598.00614] controller-0 pmond mon pmonHdlr.cpp (1540) manage_alarm : Info : hbsClient process has failed ; Auto recovery in progress.
2018-12-10T15:35:05.486 [35598.00615] controller-0 pmond mon pmonFsm.cpp ( 562) pmon_passive_handler : Info : hbsClient stability period (10 secs)
2018-12-10T15:35:05.486 [35598.00616] controller-0 pmond mon pmonHdlr.cpp (1102) unregister_process : Info : hbsClient unregistered (2712)
2018-12-10T15:35:05.486 [35598.00617] controller-0 pmond mon pmonHdlr.cpp (1204) respawn_process : Info : hbsClient restart of running process
2018-12-10T15:35:05.486 [35598.00618] controller-0 pmond mon pmonHdlr.cpp ( 942) kill_running_process : Warn : hbsClient kill succeeded (2712)
2018-12-10T15:35:05.488 [35598.00619] controller-0 pmond mon pmonHdlr.cpp (1304) respawn_process : Info : hbsClient Spawn (20171)
2018-12-10T15:35:06.985 [35598.00620] controller-0 pmond mon pmonFsm.cpp ( 619) pmon_passive_handler : Info : hbsClient Monitor (20187)
2018-12-10T15:35:17.485 [35598.00621] controller-0 pmond mon pmonFsm.cpp ( 653) pmon_passive_handler : Info : hbsClient Stable (20187)
2018-12-10T15:35:17.985 [35598.00622] controller-0 pmond mon pmonFsm.cpp ( 725) pmon_passive_handler : Info : hbsClient Recovered (20187)
2018-12-10T15:35:17.985 [35598.00623] controller-0 pmond mon pmonHdlr.cpp (1137) register_process : Info : hbsClient Registered (20187)
2018-12-10T15:35:18.603 [35598.00624] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:23.985 [35598.00625] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:28.603 [35598.00626] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:33.485 [35598.00627] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:38.485 [35598.00628] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:43.485 [35598.00629] controller-0 pmond mon pmonFsm.cpp ( 291) pmon_active_handler : Info : hbsClient is healthy (debouncing)
2018-12-10T15:35:43.485 [35598.00630] controller-0 pmond mon pmonFsm.cpp ( 303) pmon_active_handler : Info : hbsClient Debounced (20187)

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.03 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-metal (master)

Reviewed: https://review.openstack.org/624784
Committed: https://git.openstack.org/cgit/openstack/stx-metal/commit/?id=4e132af30884d2a81700be0667a7b03cab1d3d94
Submitter: Zuul
Branch: master

commit 4e132af30884d2a81700be0667a7b03cab1d3d94
Author: Eric MacDonald <email address hidden>
Date: Wed Dec 12 08:10:30 2018 -0500

    Mtce: fix hbsClient active monitoring over config reload

    The maintenance process monitor is failing the hbsClient
    process over config or process reload operations.

    The issue relates to the hbsClient's subfunction being
    'last-config' without pmon properly gating the active
    monitoring FSM from starting until the passive monitoring
    phase is complete and in the MANAGE state.

    Test Plan

    PASS: Verify active monitoring failure detection and handling
    PASS: Verify proper process monitoring over pmond config reload
    PASS: Verify proper process monitoring over SIGHUP -> pmond
    PASS: Verify proper process monitoring over SIGUSR2 -> pmond
    PASS: Verify proper process monitoring over process failure recovery
    PASS: Verify pmond regression test soak ; on active and inactive controllers
    PASS: Verify pmond regression test soak ; on compute node
    PASS: Verify pmond regression test soak ; kill/recovery function
    PASS: Verify pmond regression test soak ; restart function
    PASS: Verify pmond regression test soak ; alarming function
    PASS: Verify pmond handles critical process failure with no restart config
    PASS: Verify pmond handles ntpd process failure

    PASS: Verify AIO DX Install
    PASS: Verify AIO DX Inactive Controller process management over Lock/Unlock.

    Change-Id: Ie2fe7b6ce479f660725e5600498cc98f36f78337
    Closes-Bug: 1807724
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: Triaged → Fix Released
Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.