StarlingX

Worker node does not recover from heartbeat failure

Bug #1847656 reported by David Sullivan on 2019-10-10

10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
While testing patch orchestration a heartbeat loss was reported for a worker node. That node did not recover from the heartbeat loss and needed to be manually locked and unlocked.

Severity
--------
Major

Steps to Reproduce
------------------
Issue was seeing during patch orchestration. May need to induce heartbeat loss to reproduce.

Expected Behavior
------------------
After heartbeat loss, the host should recover without manual intervention.

Actual Behavior
----------------
A lock/unlock was required to recover the host.

Reproducibility
---------------
Has consistently occurred in this lab with different worker nodes. But not 100% of the time.

System Configuration
--------------------
2 + 20 system. IPv6

Branch/Pull Time/Commit
-----------------------
Build from 2019-09-26_20-00-00

Last Pass
---------
Appears patch orchestration was tested last with build 2019-05-27_09-53-12

Timestamp/Logs
--------------
Hosts unlocked
2019-10-09T22:26:03.938 controller-0 VIM_Thread[114048] INFO _host_director.py.523 Unlock hosts: [u'compute-4', u'compute-6', u'compute-3', u'compute-7', u'compute-0', u'compute-10', u'compute-9', u'compute-14', u'compute-2', u'compute-15']
Compute-4 is coming up
2019-10-09T22:33:05.971 [113648.01502] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : compute-4 Task: Enabling (seq:26)
Heartbeat loss
2019-10-09T22:33:07.420 [113648.01529] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 971) service_events : Info : compute-4 heartbeat loss (Mgmnt)
2019-10-09T22:33:07.420 [113648.01530] controller-0 mtcAgent hbs nodeClass.cpp (4679) manage_heartbeat_failure:Error : compute-4 Mgmnt *** Heartbeat Loss *** (during enable soak)
Manual lock/unlock
2019-10-10 05:16:45.752 113985 INFO sysinv.api.controllers.v1.host [-] compute-4 host.update.ihost_val_prenotify {'ihost_action': 'lock'}
2019-10-10 05:19:03.374 113985 INFO sysinv.api.controllers.v1.host [-] compute-4 host.update.ihost_val_prenotify {'ihost_action': 'unlock'}

Test Activity
-------------
System Testing

Tags:

Revision history for this message

David Sullivan (dsullivanwr) wrote on 2019-10-10:

#1

mtcAgent.log Edit (11.0 MiB, text/plain)

Eric MacDonald (rocksolidmtce) on 2019-10-10

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-11:

#2

Marking as stx.3.0 / medium priority - system should automatically recover from a heartbeat loss

tags:	added: stx.3.0 stx.metal
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-11:

#3

The mtcAgent.log file is not sufficient to debug this issue.
Need a collect for compute-4.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-15: Fix proposed to metal (master)

#4

Fix proposed to branch: master
Review: https://review.opendev.org/688724

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-15:

#5

Fix posted for review.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-16: Fix merged to metal (master)

#6

Reviewed: https://review.opendev.org/688724
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=df50847580735b84dd68c56ea16b24bfc6f64f5c
Submitter: Zuul
Branch: master

commit df50847580735b84dd68c56ea16b24bfc6f64f5c
Author: Eric MacDonald <email address hidden>
Date: Tue Oct 15 10:30:14 2019 -0400

Ensure hbsClient ready event is cleared over a reboot.

A host sometimes (rarely) fails heartbeat immediately
following unlock.

    The hbsClient sends its ready event every 5 seconds.
    Mtce uses this event message as a clue that the target
    host is ready to start heartbeat following Graceful
    Recovery or in this case Enable sequence.

    This update fixes a potential race condition where the
    hbsClient ready event snuck through immediately following
    the unlock reboot. This tricked mtc into starting heartbeat
    too early following the online event that follows a reboot
    which lead to a heartbeat failure.

    Test Plan:
    PASS: compute system install
    PASS: standby controller lock/unlock soak (25 loops)
    PASS: 2 compute async locked/unlock soak (50 loops each)

Regression:
PASS: inservice hearbeat failure detection and handling

    Change-Id: I21699dbb2f0ab7355a9384d78b47a1fd1cea496d
    Closes-Bug: 1847656
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

mtcAgent.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.