StarlingX

Debian: Host unlock sometimes fails with heartbeat loss

Bug #1993656 reported by Eric MacDonald on 2022-10-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Eric MacDonald

Bug Description

Brief Description
-----------------
Unlock of a system node is seen to sometimes fail with a heartbeat loss.

Initial analysis shows that a slower shutdown sequence in Debian can
trick the offline handler into seeing the node as recovered when in
fact it's still shutting down.

This early recover leads to the enable heartbeat soak test failure

Severity
--------
Major: Unlock takes longer than it should due to retry

Steps to Reproduce
------------------
In a Debian installed and provisioned system ...

   system host-lock <hostname>
   sleep 60
   system host-unlock <hostname>

Expected Behavior
------------------
Host goes enabled after single unlock reboot.

Actual Behavior
----------------
Host unlock fails and is retried. This leads to 2 more reboots ; and possibly but rarely more.

Reproducibility
---------------
less than 5% reproducible

System Configuration
--------------------
All-In-One IPV4 and IPV6, possibly standard as well.

Branch/Pull Time/Commit
-----------------------
2022-09-01_18-00-06

Last Pass
---------
CentOS

Timestamp/Logs
--------------
2022-09-06T16:51:05.522 [90166.00512] controller-1 mtcAgent hbs nodeClass.cpp (5050) manage_heartbeat_failure:Error : controller-0 Mgmnt *** Heartbeat Loss *** (during enable soak)

Test Activity
-------------
Automated Regression Testing

Workaround
----------
System auto recovers after first failure.
Lock and Unlock if the node enters Auto Recovery Disable state due to 2 back to back unlock enable failures in a row.

Tags:

Eric MacDonald (rocksolidmtce) on 2022-10-20

Changed in starlingx:
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-20: Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/862161

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-24: Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/862161
Committed: https://opendev.org/starlingx/metal/commit/da398e0c5f12a797a5a4039b99743d5648710527
Submitter: "Zuul (22348)"
Branch: master

commit da398e0c5f12a797a5a4039b99743d5648710527
Author: Eric MacDonald <email address hidden>
Date: Thu Oct 20 15:45:49 2022 +0000

Debian: Make Mtce offline handler more resilient to slow shutdowns

    The current offline handler assumes the node is offline after
    'offline_search_count' reaches 'offline_threshold' count
    regardless of whether mtcAlive messages were received during
    the search window.

The offline algorithm requires that no mtcAlive messages
be seen for the full offline_threshold count.

    During a slow shutdown the mtcClient runs for longer than
    it should and as a result can lead to maintenance seeing
    the node as recovered before it should.

    This update manages the offline search counter to ensure that
    it only reached the count threshold after seeing no mtcAlive
    messages for the full search count. Any mtcAlive message seen
    during the count triggers a count reset.

    This update also
    1. Adjusts the reset retry cadence from 7 to 12 secs
       to prevent unnecessary reboot thrash during
       the current shutdown.
    2. Clears the hbsClient ready event at the start of the
       subfunction handler so the heartbeat soak is only
       started after seeing heartbeat client ready events
       that follow the main config.

Test Plan:

    PASS: Debian and CentOS Build and DX install
    PASS: Verify search count management
    PASS: Verify issue does not occur over lock/unlock soak (100+)
          - where the same test without update did show issue.
    PASS: Monitor alive logs for behavioral correctness
    PASS: Verify recovery reset occurs after expected extended time.

    Closes-Bug: 1993656
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e

Reviewed:  https://review.opendev.org/c/starlingx/metal/+/862161
Committed: https://opendev.org/starlingx/metal/commit/da398e0c5f12a797a5a4039b99743d5648710527
Submitter: "Zuul (22348)"
Branch:    master

commit da398e0c5f12a797a5a4039b99743d5648710527
Author: Eric MacDonald <eric.macdonald@windriver.com>
Date:   Thu Oct 20 15:45:49 2022 +0000

Debian: Make Mtce offline handler more resilient to slow shutdowns
    
    The current offline handler assumes the node is offline after
    'offline_search_count' reaches 'offline_threshold' count
    regardless of whether mtcAlive messages were received during
    the search window.
    
    The offline algorithm requires that no mtcAlive messages
    be seen for the full offline_threshold count.
    
    During a slow shutdown the mtcClient runs for longer than
    it should and as a result can lead to maintenance seeing
    the node as recovered before it should.
    
    This update manages the offline search counter to ensure that
    it only reached the count threshold after seeing no mtcAlive
    messages for the full search count. Any mtcAlive message seen
    during the count triggers a count reset.
    
    This update also
    1. Adjusts the reset retry cadence from 7 to 12 secs
       to prevent unnecessary reboot thrash during
       the current shutdown.
    2. Clears the hbsClient ready event at the start of the
       subfunction handler so the heartbeat soak is only
       started after seeing heartbeat client ready events
       that follow the main config.
    
    Test Plan:
    
    PASS: Debian and CentOS Build and DX install
    PASS: Verify search count management
    PASS: Verify issue does not occur over lock/unlock soak (100+)
          - where the same test without update did show issue.
    PASS: Monitor alive logs for behavioral correctness
    PASS: Verify recovery reset occurs after expected extended time.
    
    Closes-Bug: 1993656
    Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
    Change-Id: If10bb75a1fb01d0ecd3f88524d74c232658ca29e

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-10-26

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.8.0 stx.metal

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.