AIO Plus computes don't get heartbeat enabled over a DOR

Bug #1954949 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Eric MacDonald

Bug Description

Brief Description:
------------------

AIO Plus computes don't get heartbeat enabled over a DOR. When the Plus feature of AIO System was added maintenance was never retrofitted to handle the DOR case for the 'Plus' (compute) nodes.

Severity:
---------

Major: No heartbeat fault detection of AIO plus compute nodes following a Dead Office Recovery (DOR) until node is locked and unlocked.

Steps to Reproduce:
-------------------

Power off and then back on a fully unlocked enabled AIO Plus System ; all controllers and plus nodes.

Expected Behavior:
------------------

All nodes recover unlocked enabled with maintenance heartbeat monitoring

Actual Behavior:
----------------

All nodes recover but maintenance heartbeat is not enabled for plus (compute) nodes.

Reproducibility:
----------------

Reproducible 100% of the time.

System Configuration:
---------------------

AIO Plus system

Branch/Pull Time/Commit:
------------------------

BUILD_DATE="2021-12-12 21:14:20 -0500"

Last Pass:
----------

Test Escape. There is no cli command to display the nodes that are being heartbeated. need to look at the hbsAgent logs, which is not convenient. Should consider adding a command for this.

Timestamp/Logs:
---------------

'sudo pkill -usr2 hbsAgent' and then look at the hbsAgent.log for the following

2021-12-15T01:10:55.633 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
2021-12-15T01:10:55.633 | Mgmnt: 5 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 100) |
2021-12-15T01:10:55.633 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
2021-12-15T01:10:55.633 | controller-1 | Y | 0 | 0 | 0 | 0 | 86d1 | 86d1 | 100 msec
2021-12-15T01:10:55.633 | controller-0 | Y | 0 | 0 | 0 | 0 | 86d1 | 86d1 | 100 msec
2021-12-15T01:10:55.633 | compute-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 | compute-1 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 | compute-2 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
2021-12-15T01:10:55.633 | Clstr: 5 | Mon | Mis | Max | Deg | Fail | Pulses Tot | Pulses | Enabled ( 100) |
2021-12-15T01:10:55.633 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+
2021-12-15T01:10:55.633 | controller-1 | Y | 0 | 0 | 0 | 0 | 86d1 | 86d1 | 100 msec
2021-12-15T01:10:55.633 | controller-0 | Y | 0 | 0 | 0 | 0 | 86d1 | 86d1 | 100 msec
2021-12-15T01:10:55.633 | compute-0 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 | compute-1 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 | compute-2 | n | 0 | 0 | 0 | 0 | 0 | 0 | 100 msec
2021-12-15T01:10:55.633 +--------------+-----+-------+-------+-------+-------+------------+----------+-----------------+

Alarms:
-------

None and that's part of the issue.

Test Activity:
--------------

Debug of other issue.

Workaround:
-----------

Lock and unlock compute hosts

Tags: stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / medium - specific scenario/config related to AIO-DX+ and DOR; workaround exists. Should fix in the stx master branch, but not required for stx.6.0

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.metal
Changed in starlingx:
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: lowering the priority as this is an issue with a specific config. This won't gate stx.7.0

tags: removed: stx.7.0
Changed in starlingx:
importance: Medium → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.