Loss of BMC access is not alarmed

Bug #1858110 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

After initial BMC access is established and then lost the BMC access alarm is not being raised.

This bug was introduced by the following recent update - https://review.opendev.org/#/c/697309/

Severity: Minor

Steps to Reproduce: power off host in VLM

Expected Behavior: BMC Access alarm is raised after a few minutes.
Actual Behavior: BMC access alarm is not raised

Reproducibility: 100%

System Configuration: Any system with BMCs provisioned.

Branch/Pull Time/Commit: Latest

Last Pass: Unknown

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Maintenance continues to manage the BMC Access alarm on initial access
failure and continues to auto recover BMC accessibility failures.

However, a loss of accessibility is not alarmed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/701030

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - issue w/ missing alarm, so not critical enough to cherrypick to r/stx.3.0

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.4.0 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/701030
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=9bf231a2866c0ff737064755d0106198d4df7d7d
Submitter: Zuul
Branch: master

commit 9bf231a2866c0ff737064755d0106198d4df7d7d
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 09:34:37 2020 -0500

    Fix BMC access loss handling

    Recent refactoring of the BMC handler FSM introduced a code change that
    prevents the BMC Access alarm from being raised after initial BMC
    accessibility was established and is then lost.

    This update ensures BMC access alarm management is working properly.

    This update also implements ping failure debounce so that a single ping
    failure does not trigger full reconnection handling. Instead that now
    requires 3 ping failures in a row. This has the effect of adding a minute
    to ping failure action handling before the usual 2 minute BMC access failure
    alarm is raised. ping failure logging is reduced/improved.

    Test Plan: for both hwmond and mtcAgent

    PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
    PASS: Verify BMC ping failure debounce handling, recovery and logging
    PASS: Verify BMC ping persistent failure handling
    PASS: Verify BMC ping periodic miss handling
    PASS: Verify BMC ping and access failure recovery timing
    PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
    PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

    Regression:

    PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
    PASS: Verify BMC power-off request handling with BMC ping failing & recovering
    PASS: Verify BMC power-on request handling with BMC ping failing & recovering
    PASS: Verify BMC reset request handling with BMC ping failing & recovering
    PASS: Verify BMC sensor group read failure handling & recovery
    PASS: Verify sensor monitoring after ping failure handling & recovery

    Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
    Closes-Bug: 1858110
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Please consider this issue's update for stx 3.0

Impact: Loss of BMC access is not alarmed

Commit Message: Fix BMC access loss handling

Recent refactoring of the BMC handler FSM introduced a code change that prevents the BMC Access alarm from being raised after initial BMC accessibility was established and is then lost.

This update ensures BMC access alarm management is working properly.

This update also implements ping failure debounce so that a single ping failure does not trigger full reconnection handling. Instead that now requires 3 ping failures in a row. This has the effect of adding a minute to ping failure action handling before the usual 2 minute BMC access failure alarm is raised. ping failure logging is reduced/improved.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Adding stx.3.0 release tag based on recommendation by development prime (above).
@Eric, please cherrypick to r/stx.3.0 at your earliest convenience.

tags: added: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/702534

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (r/stx.3.0)

Reviewed: https://review.opendev.org/702534
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=eaf1f0c93db8e1219914ee0ae4b81a99549cff18
Submitter: Zuul
Branch: r/stx.3.0

commit eaf1f0c93db8e1219914ee0ae4b81a99549cff18
Author: Eric MacDonald <email address hidden>
Date: Fri Jan 3 09:34:37 2020 -0500

    Fix BMC access loss handling

    Recent refactoring of the BMC handler FSM introduced a code change that
    prevents the BMC Access alarm from being raised after initial BMC
    accessibility was established and is then lost.

    This update ensures BMC access alarm management is working properly.

    This update also implements ping failure debounce so that a single ping
    failure does not trigger full reconnection handling. Instead that now
    requires 3 ping failures in a row. This has the effect of adding a minute
    to ping failure action handling before the usual 2 minute BMC access failure
    alarm is raised. ping failure logging is reduced/improved.

    Test Plan: for both hwmond and mtcAgent

    PASS: Verify BMC access alarm due to bad provisioning (un, pw, ip, type)
    PASS: Verify BMC ping failure debounce handling, recovery and logging
    PASS: Verify BMC ping persistent failure handling
    PASS: Verify BMC ping periodic miss handling
    PASS: Verify BMC ping and access failure recovery timing
    PASS: Verify BMC ping failure and recovery handling over BMC link pull/plug
    PASS: Verify BMC sensor monitoring stops/resumes over ping failure/recovery

    Regression:

    PASS: Verify IPv6 System Install using provisioned BMCs (wp8-12)
    PASS: Verify BMC power-off request handling with BMC ping failing & recovering
    PASS: Verify BMC power-on request handling with BMC ping failing & recovering
    PASS: Verify BMC reset request handling with BMC ping failing & recovering
    PASS: Verify BMC sensor group read failure handling & recovery
    PASS: Verify sensor monitoring after ping failure handling & recovery

    Change-Id: I74870816930ef6cdb11f987424ffed300ff8affe
    Closes-Bug: 1858110
    Signed-off-by: Eric MacDonald <email address hidden>
    (cherry picked from commit 9bf231a2866c0ff737064755d0106198d4df7d7d)

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (9.6 KiB)

Reviewed: https://review.opendev.org/705848
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=3e2ecfca815619a0aea2a7b8169ad297c09aaec8
Submitter: Zuul
Branch: f/centos8

commit 1f0706ee30f44c666714ea16739c346326074fcb
Author: Don Penney <email address hidden>
Date: Mon Feb 3 12:48:03 2020 -0500

    Drop clone.py references from boot menu comment blocks

    As part of recent code cleanup activities, the controllerconfig
    clone.py module was dropped. This update removes references to this
    file from the comment blocks of the boot menu files grub.cfg and
    centos.syslinux.cfg.

    Change-Id: If2260c74f58a61481cf3ed7c7fbbe5ebb5292b00
    Partial-Bug: 1834218
    Signed-off-by: Don Penney <email address hidden>

commit 91f488af02e2bc27007690450a3ee63826e67c86
Author: Jim Somerville <email address hidden>
Date: Mon Jan 27 17:18:47 2020 -0500

    Security: Handle nospectre_v1 in the bootargs

    Most of the v1 mitigation is baked into the kernel and not
    optional. The swapgs barriers are, however, optional.
    They have a negative performance impact so we disable them
    by using the nospectre_v1 kernel bootarg.

    Partial-Bug: 1860193
    Depends-On: https://review.opendev.org/#/c/704406
    Change-Id: Id11232fe113293ed04b2802aaf038e2eedf9d797
    Signed-off-by: Jim Somerville <email address hidden>

commit eff0d663776587a6ccca6c30a0433baf8663aa09
Author: Angie Wang <email address hidden>
Date: Mon Jan 20 15:09:10 2020 -0500

    Remove unused post_clone_iso_ks.cfg

    Change-Id: I3be9384b94473cc6e0f6efbc1e404c5878856ffc
    Partial-Bug: 1834218
    Depends-On: https://review.opendev.org/#/c/703516/
    Signed-off-by: Angie Wang <email address hidden>

commit 4609dacc5f2ffa54012eaffc60196488a6c589b6
Author: Eric MacDonald <email address hidden>
Date: Mon Jan 20 11:34:07 2020 -0500

    Fix rvmc container build

    The current rvmc container build has a conflict over the
    python3-pip install.

    The CentOS repo has a prior version of python3-pip that
    requires python3 3.6 but the CENGN repo requires 3.7 due
    to the Titanium base image.

    The issue is resolved by updating this dockerfile with an
    older verison of python3-pip.

    Also, Since python3-pip provides pip3 the explicit pip3
    install is no longer required.

    Change-Id: Ic0cf0d070eaa8f437a043ac52dfa7ecf0e42f957
    Story: 2006980
    Task: 37775
    Signed-off-by: Eric MacDonald <email address hidden>

commit a9a2ca64bc409dd74fc24639e2ece334324c4b8d
Author: Saul Wold <email address hidden>
Date: Wed Jan 15 19:22:09 2020 -0800

    rvmc: remove un-used build data

    The error below reported in the build logs because rvmc is not
    setup to be built as an RPM package.

    ERROR: build_dir (425): Neither srpm_path nor .spec file not found
    in '/localdisk/designer/swold/stx/cgcs-root/stx/metal/tools/rvmc/centos'

    Closes-Bug: 1859893

    Change-Id: I9b2788bb227afbdf49e2faa5f05628331719233e
    Signed-off-by: Saul Wold <email address hidden>

commit d59ba5fdc21a89581cdc4e3fad038645b9d20754
Author: Al Baile...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.