Worker node does not recover from heartbeat failure

Bug #1847656 reported by David Sullivan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
While testing patch orchestration a heartbeat loss was reported for a worker node. That node did not recover from the heartbeat loss and needed to be manually locked and unlocked.

Severity
--------
Major

Steps to Reproduce
------------------
Issue was seeing during patch orchestration. May need to induce heartbeat loss to reproduce.

Expected Behavior
------------------
After heartbeat loss, the host should recover without manual intervention.

Actual Behavior
----------------
A lock/unlock was required to recover the host.

Reproducibility
---------------
Has consistently occurred in this lab with different worker nodes. But not 100% of the time.

System Configuration
--------------------
2 + 20 system. IPv6

Branch/Pull Time/Commit
-----------------------
Build from 2019-09-26_20-00-00

Last Pass
---------
Appears patch orchestration was tested last with build 2019-05-27_09-53-12

Timestamp/Logs
--------------
Hosts unlocked
2019-10-09T22:26:03.938 controller-0 VIM_Thread[114048] INFO _host_director.py.523 Unlock hosts: [u'compute-4', u'compute-6', u'compute-3', u'compute-7', u'compute-0', u'compute-10', u'compute-9', u'compute-14', u'compute-2', u'compute-15']
Compute-4 is coming up
2019-10-09T22:33:05.971 [113648.01502] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-4 Task: Enabling (seq:26)
Heartbeat loss
2019-10-09T22:33:07.420 [113648.01529] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : compute-4 heartbeat loss (Mgmnt)
2019-10-09T22:33:07.420 [113648.01530] controller-0 mtcAgent hbs nodeClass.cpp (4679) manage_heartbeat_failure:Error : compute-4 Mgmnt *** Heartbeat Loss *** (during enable soak)
Manual lock/unlock
2019-10-10 05:16:45.752 113985 INFO sysinv.api.controllers.v1.host [-] compute-4 host.update.ihost_val_prenotify {'ihost_action': 'lock'}
2019-10-10 05:19:03.374 113985 INFO sysinv.api.controllers.v1.host [-] compute-4 host.update.ihost_val_prenotify {'ihost_action': 'unlock'}

Test Activity
-------------
System Testing

Revision history for this message
David Sullivan (dsullivanwr) wrote :
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - system should automatically recover from a heartbeat loss

tags: added: stx.3.0 stx.metal
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The mtcAgent.log file is not sufficient to debug this issue.
Need a collect for compute-4.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/688724

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Fix posted for review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/688724
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=df50847580735b84dd68c56ea16b24bfc6f64f5c
Submitter: Zuul
Branch: master

commit df50847580735b84dd68c56ea16b24bfc6f64f5c
Author: Eric MacDonald <email address hidden>
Date: Tue Oct 15 10:30:14 2019 -0400

    Ensure hbsClient ready event is cleared over a reboot.

    A host sometimes (rarely) fails heartbeat immediately
    following unlock.

    The hbsClient sends its ready event every 5 seconds.
    Mtce uses this event message as a clue that the target
    host is ready to start heartbeat following Graceful
    Recovery or in this case Enable sequence.

    This update fixes a potential race condition where the
    hbsClient ready event snuck through immediately following
    the unlock reboot. This tricked mtc into starting heartbeat
    too early following the online event that follows a reboot
    which lead to a heartbeat failure.

    Test Plan:
    PASS: compute system install
    PASS: standby controller lock/unlock soak (25 loops)
    PASS: 2 compute async locked/unlock soak (50 loops each)

    Regression:
    PASS: inservice hearbeat failure detection and handling

    Change-Id: I21699dbb2f0ab7355a9384d78b47a1fd1cea496d
    Closes-Bug: 1847656
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.