hbsAgent monitored interface name changes during monitoring
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Eric MacDonald |
Bug Description
The hbsAgent, the active/active maintenance monitoring service, seen on 2 occasions to start flooding its log file with 2 messages that indicate the name of the monitored interface has changed to ...
Occurrence 1: a cali container interface
Occurrence 2: a host name of compute-0 wildcat 3-6 - titanium load - designer
Occurrence 3: host name of self controller-0 ironpass-14_17 - starling-x load - cengn
Bug report uses Occurrence 2 logs
Severity: Critical
Steps to Reproduce: Unknown
Steps to Recover: restart hbsAgent by 'sudo systemctl restart hbsAgent'
Expected Behavior: heartbeat service monitors heartbeat to in-service hosts with failure detection.
Actual Behavior: hbsAgent disables itself after getting a netlink event and flooding logs stating the incorrect interface is pown.
2019-05-
2019-05-
[flood at rate of both logs every heartbeat period ; 100msec]
This is significant ; frequent hbsAgent.log rotate
start of issue logs are rotated out ; impacts ability to debug
Reproducibility: Intermittent ; rare
System Configuration: Normal system -
wildcat 3-6 - designer build titanium load
ironpass-14_17 - cengen built starling-x load
logged into several large system labs in search for issue but not found
Branch/Pull Time/Commit: Designer build
-------
controller-0:~$ cat /etc/build.info <- wildcat 3-6
SW_VERSION="19.01"
BUILD_TARGET=
BUILD_TYPE=
BUILD_ID="n/a"
JOB="n/a"
BUILD_BY="swebster"
BUILD_NUMBER="n/a"
BUILD_HOST=
BUILD_DATE=
BUILD_DIR="/"
WRS_SRC_
WRS_GIT_
CGCS_SRC_
CGCS_GIT_
3rd Occurrence in ironpass-14_17
controller-0:~$ cat /etc/build.info <- ironpass-14_17
OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
JOB="STX_
<email address hidden>"
BUILD_NUMBER="105"
BUILD_HOST=
BUILD_DATE=
controller-0:~$ vi /^C
controller-0:~$ tail -f /var/log/
2019-05-
2019-05-
2019-05-
2019-05-
Last Pass: N/A ; raised by designer
Timestamp/Logs:
--------------
Logs were collected for Occurrence 2 and are attached.
offending logs above.
Test Activity: N/A
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Issue is not as rare as I thought.
A scan of large systems show that it was happening in about 50% of them.
Cannot debug due to the flooding having rotated the logs that lead up the the start of the issue rotating out.
First need to deliver a change that stops the flooding.
Development of such update is in progress.