Frequent mtce heartbeat misses in virtual environment

Bug #1885581 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Eric MacDonald

Bug Description

The hbsAgent log reports many ongoing heartbeat miss logs with a frequency that could (rarely) escalate to alarm or host degrade.

Heartbeating a cluster of hosts at a 100 ms cadence in virtual environment is known to exhibit this behavior. To account for this the hbsAgent supports a -V (virtual) startup option that commands it
to run in 'virtual' mode. In virtual mode the hbsAgent overrides the configured heartbeat cadence
with a static 500ms cadence. Heatbeating at a 500 msec cadence in virtual environment is fine.

The hbsAgent startup script calls 'virt-what' as a means to detect if the active controller is running in virtual mode and enables that mode if it is. However, output of 'virt-what' in the new virtual installer 'vdm' tool is different compared to how it was tested in the past. Now the script parsing of that output is no longer able to detect virtual mode so it heartbeats at the configured value thereby causing this issue.

The fix for this issue is to enhance the hbsAgent startup script to better handle the output of 'virt-what' to continue to handle the old but also accommodate for the way the vdm presents the output.

Severity
--------
Minor: Affects systems running in a virtual environment in a minor way.

Steps to Reproduce
------------------
Install system with 'vdm' tool

Expected Behavior
------------------
heartbeat at 500 msec cadence

Actual Behavior
----------------
Heartbeat at 100 msec cadence

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Any duplex system

Branch/Pull Time/Commit
-----------------------
2020-06-26_04-10-00

Last Pass
---------
N/A

Timestamp/Logs
--------------
2020-06-25T13:44:39.702 Cluster Vault : C0 Mgmnt [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:2] [3:2] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-06-25T13:44:39.702 Cluster Vault : C1 Mgmnt [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]>[3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3] [3:3]
2020-06-25T13:44:47.154 [90647.01127] controller-0 hbsAgent hbs nodeClass.cpp (8593) lost_pulses : Info : controller-1 Mgmnt Pulse Miss ( 2) (max: 4)
2020-06-25T13:44:47.261 [90647.01128] controller-0 hbsAgent hbs nodeClass.cpp (8593) lost_pulses : Info : controller-1 Mgmnt Pulse Miss ( 3) (max: 4)
2020-06-25T13:44:47.367 [90647.01129] controller-0 hbsAgent hbs nodeClass.cpp (8593) lost_pulses : Info : controller-1 Mgmnt Pulse Miss ( 4) (max: 4)
2020-06-25T13:44:47.367 [90647.01130] controller-0 hbsAgent hbs nodeClass.cpp (8616) lost_pulses : Warn : controller-1 Mgmnt -> MINOR

No collect required. Issue and fix is understood.

Test Activity
-------------
Feature Testing

Workaround
----------
system service-parameter-modify platform maintenance heartbeat_period=500
system service-parameter-update

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

low priority - minor issue on virtual env

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.metal
Revision history for this message
John Kung (john-kung) wrote :

No progress on this bug for more than 2 years. Candidate for closure.

If there is no update, this issue is targeted to be closed as 'Won't Fix' in 2 weeks.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Fine by me. I tried to close as Won't fix but that option is not selectable.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.