hbsAgent monitored interface name changes during monitoring

Bug #1829608 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

The hbsAgent, the active/active maintenance monitoring service, seen on 2 occasions to start flooding its log file with 2 messages that indicate the name of the monitored interface has changed to ...

 Occurrence 1: a cali container interface
 Occurrence 2: a host name of compute-0 wildcat 3-6 - titanium load - designer
 Occurrence 3: host name of self controller-0 ironpass-14_17 - starling-x load - cengn
Bug report uses Occurrence 2 logs

Severity: Critical

Steps to Reproduce: Unknown
Steps to Recover: restart hbsAgent by 'sudo systemctl restart hbsAgent'

Expected Behavior: heartbeat service monitors heartbeat to in-service hosts with failure detection.

Actual Behavior: hbsAgent disables itself after getting a netlink event and flooding logs stating the incorrect interface is pown.

2019-05-17T11:42:41.504 [91165.12505818] controller-0 hbsAgent hbs hbsAgent.cpp (1737) daemon_service_run : Info : heartbeat service going enabled
2019-05-17T11:42:41.504 [91165.12505819] controller-0 hbsAgent hbs hbsAgent.cpp (2033) daemon_service_run : Info : Heartbeat disabled by compute-0 link down event

[flood at rate of both logs every heartbeat period ; 100msec]

This is significant ; frequent hbsAgent.log rotate

start of issue logs are rotated out ; impacts ability to debug

Reproducibility: Intermittent ; rare
System Configuration: Normal system -

wildcat 3-6 - designer build titanium load
ironpass-14_17 - cengen built starling-x load

logged into several large system labs in search for issue but not found

Branch/Pull Time/Commit: Designer build
-----------------------
controller-0:~$ cat /etc/build.info <- wildcat 3-6
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="swebster"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-swebster-d3-vm1"
BUILD_DATE="2019-05-13 09:26:52 -0500"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/swebster/workspace/titanium//starlingx-0/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/swebster/workspace/titanium//starlingx-0/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

3rd Occurrence in ironpass-14_17

controller-0:~$ cat /etc/build.info <- ironpass-14_17
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190517T013000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="105"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-05-17 01:30:00 +0000"
controller-0:~$ vi /^C
controller-0:~$ tail -f /var/log/hbsAgent.log
2019-05-18T12:46:33.556 [96412.3019422] controller-0 hbsAgent hbs hbsAgent.cpp (1737) daemon_service_run : Info : heartbeat service going enabled
2019-05-18T12:46:33.556 [96412.3019423] controller-0 hbsAgent hbs hbsAgent.cpp (2033) daemon_service_run : Info : Heartbeat disabled by controller-0 link down event
2019-05-18T12:46:33.601 [96412.3019424] controller-0 hbsAgent hbs hbsAgent.cpp (1737) daemon_service_run : Info : heartbeat service going enabled
2019-05-18T12:46:33.606 [96412.3019425] controller-0 hbsAgent hbs hbsAgent.cpp (2033) daemon_service_run : Info : Heartbeat disabled by controller-0 link down event

Last Pass: N/A ; raised by designer

Timestamp/Logs:
--------------

Logs were collected for Occurrence 2 and are attached.

offending logs above.

Test Activity: N/A

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is not as rare as I thought.
A scan of large systems show that it was happening in about 50% of them.

Cannot debug due to the flooding having rotated the logs that lead up the the start of the issue rotating out.

First need to deliver a change that stops the flooding.
Development of such update is in progress.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Logs from WC3-6

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Seems the hbs_config.mgmnt_iface pointer gets changed to an invalid value.

Update will save initial configured mgmnt name as a c string and its pointer value, change hbsAgent runtime code to use that new variable and create a unique Swerr log when the main loop detects that the hbs_config.mgmnt_iface pointer or what it points to ever changes and then will proceed to auto correct the hbs_config.mgmnt_iface pointer and content to see how often it changes.

Doing so auto corrects from the issue and avoids the log flood but does not address the root cause of the name change.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I think I found a bug in the code that could lead to this issue.
As a test I added Fault Insertion capability to force the code down the path that leads to the suspect bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/660704

Changed in starlingx:
status: New → In Progress
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The aforementioned suspect code was proven by fault insertion testing to be the root cause of the issue.
An update has been tested and submitted for review.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

fix update out for review

https://review.opendev.org/#/c/660704/1

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; system ends up in a bad state and requires manual intervention to recover.

Changed in starlingx:
importance: Undecided → High
tags: added: stx.2.0 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/660704
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=fb89f62a9e0e08cb7a995497d0d1f6cdc86e53fc
Submitter: Zuul
Branch: master

commit fb89f62a9e0e08cb7a995497d0d1f6cdc86e53fc
Author: Eric MacDonald <email address hidden>
Date: Wed May 22 08:45:14 2019 -0400

    Fix hbsAgent mgmnt interface name corruption

    The hbsAgent socket initialization process is declaring
    a local string variable and then loading the management
    interface name into it and setting its static management
    interface config pointer to that variable.

    So, when that local variable falls out of scope and that
    local (stack) variable memory is recycled and changed by
    its new owner then the data the management interface config
    pointer points to changes and that leads to the reported
    error condition and log flooding.

    This update allocates a new memory pointer instead.

    Change-Id: I858bcc69455f1d915f2873c47a75dd1139cf8fcb
    Closes-Bug: 1829608
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.