Fix the logic error for hostwd

Bug #1824983 reported by Tao Li
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
wanghao

Bug Description

Brief Description
-----------------
In each period, the pmon_grace_loops is reduced two. In fact, minus one is enough.

Severity
--------
major

Steps to Reproduce
------------------

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------

System Configuration
--------------------
STD system

Branch/Pull Time/Commit
-----------------------
2019-10-04_16

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

You did not fill in the template.

What is the impact to you ?

Even though it decrements the number by 2 the start value is 5.
I recall the design being more than 2 back to back misses is a failure resulting in watchdog timeout reboot.

If there is a code change for this issue then be sure to maintain the current behavior in terms of misses before self reboot.

Guessing that would be removing one of the "ctrl->pmon_grace_loops--" and reducing the config value of "config->hostwd_failure_threshold" to 3.

Severity is minor.

I'll take the issue.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Low priority / does not gate any starlingx release given there is no end system impact.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Low
tags: added: stx.metal
Revision history for this message
wanghao (wanghao749) wrote :

Hi, Eric, there is a fixing patch: https://review.openstack.org/#/c/652982/. We can work on this together. Thanks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/652982
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=4150f91ff0687268e094a58ab177a12d56e97114
Submitter: Zuul
Branch: master

commit 4150f91ff0687268e094a58ab177a12d56e97114
Author: wanghao <email address hidden>
Date: Tue Apr 16 19:33:22 2019 +0800

    Fix the logic error in hostwd

    In each period, the pmon_grace_loops is reduced two. In fact,
    minus one is enough.

    Closes-Bug: #1824983
    Signed-off-by: wanghao <email address hidden>

    Change-Id: I38ad97be5554f320d3a6e7366e39afe518ae7a92

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding the stx.2.0 release label as the fix was merged for that release

Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → wanghao (wanghao749)
tags: added: stx.2.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/675452
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=8f9decd6e53c1712084399350aefb2b50b9f5236
Submitter: Zuul
Branch: master

commit 8f9decd6e53c1712084399350aefb2b50b9f5236
Author: Alex Kozyrev <email address hidden>
Date: Thu Aug 8 15:26:08 2019 -0400

    Fix number of allowed missed messages from pmon in hostwd

    The following commit attempted to fix failure logic in hostwd:
    4150f91ff0687268e094a58ab177a12d56e97114
    hostwd_failure_threshold has been reduced from 5 to 3 in it, but the
    logic around pmon_grace_loops hasn't been touched.
    hostwd_failure_threshold is still reduced by 2 every time we miss a
    message from pmon. So, we only allow 1 missed message before the reboot.
    Removing this double substraction and reverting to old beavior:
    allow 2 missed messages from PMON and reboot a system on a 3rd one.

    Closes-Bug: 1824983
    Change-Id: I57293f9383302d042faee78189409043e21b5801
    Signed-off-by: Alex Kozyrev <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676253

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (r/stx.2.0)

Reviewed: https://review.opendev.org/676253
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=7ffbe49712e8b12a69646d9ac98e07b156c4a64a
Submitter: Zuul
Branch: r/stx.2.0

commit 7ffbe49712e8b12a69646d9ac98e07b156c4a64a
Author: Alex Kozyrev <email address hidden>
Date: Thu Aug 8 15:26:08 2019 -0400

    Fix number of allowed missed messages from pmon in hostwd

    The following commit attempted to fix failure logic in hostwd:
    4150f91ff0687268e094a58ab177a12d56e97114
    hostwd_failure_threshold has been reduced from 5 to 3 in it, but the
    logic around pmon_grace_loops hasn't been touched.
    hostwd_failure_threshold is still reduced by 2 every time we miss a
    message from pmon. So, we only allow 1 missed message before the reboot.
    Removing this double substraction and reverting to old beavior:
    allow 2 missed messages from PMON and reboot a system on a 3rd one.

    Closes-Bug: 1824983
    Change-Id: I57293f9383302d042faee78189409043e21b5801
    Signed-off-by: Alex Kozyrev <email address hidden>
    (cherry picked from commit 8f9decd6e53c1712084399350aefb2b50b9f5236)

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per discussion with John Kung and Alex Kozyrev, the subsequent commit for this bug was cherrypicked to the stx.2.0 release branch as the original commit introduced an issue. See above for more details.

tags: added: in-r-stx20
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.