Worker node does not recover from heartbeat failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
While testing patch orchestration a heartbeat loss was reported for a worker node. That node did not recover from the heartbeat loss and needed to be manually locked and unlocked.
Severity
--------
Major
Steps to Reproduce
------------------
Issue was seeing during patch orchestration. May need to induce heartbeat loss to reproduce.
Expected Behavior
------------------
After heartbeat loss, the host should recover without manual intervention.
Actual Behavior
----------------
A lock/unlock was required to recover the host.
Reproducibility
---------------
Has consistently occurred in this lab with different worker nodes. But not 100% of the time.
System Configuration
-------
2 + 20 system. IPv6
Branch/Pull Time/Commit
-------
Build from 2019-09-26_20-00-00
Last Pass
---------
Appears patch orchestration was tested last with build 2019-05-27_09-53-12
Timestamp/Logs
--------------
Hosts unlocked
2019-10-
Compute-4 is coming up
2019-10-
Heartbeat loss
2019-10-
2019-10-
Manual lock/unlock
2019-10-10 05:16:45.752 113985 INFO sysinv.
2019-10-10 05:19:03.374 113985 INFO sysinv.
Test Activity
-------------
System Testing
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Marking as stx.3.0 / medium priority - system should automatically recover from a heartbeat loss