mtcAgent and/or hwmond connecting to the BMC over process restart failed intermittently

Bug #1818284 reported by mhg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Alexander Kozyrev

Bug Description

Brief Description
-----------------
An intermittent issue with the mtcAgent and/or hwmond connecting to the BMC over process restart triggered alarms, and which in turn failed several Sanity test cases. The alarms were in form of:
200.010 | controller-1 access to board management module has failed.

Severity
--------
Major

Steps to Reproduce
------------------
Swact the active controllers or force-reboot a node with running VMs

Expected Behavior
------------------
No alarms remained uncleared after 5 minutes

Actual Behavior
----------------
There were alarms remained uncleared after 5 minutes, e.g.:
200.010 | controller-0 access to board management module has failed
200.010 | controller-1 access to board management module has failed

Reproducibility
---------------
Intermittent

System Configuration
--------------------
found on Two node system, but may exist on other type of labs

Branch/Pull Time/Commit
-----------------------
CentOS7.6

Timestamp/Logs
--------------
2019-02-27 22:44:47

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; maybe related to code changes to introduce Barbican

Changed in starlingx:
assignee: nobody → Alex Kozyrev (akozyrev)
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.metal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-metal (master)

Fix proposed to branch: master
Review: https://review.openstack.org/649988

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-metal (master)

Reviewed: https://review.openstack.org/649988
Committed: https://git.openstack.org/cgit/openstack/stx-metal/commit/?id=aeb2c1f20af84047803533de1873d8887722501b
Submitter: Zuul
Branch: master

commit aeb2c1f20af84047803533de1873d8887722501b
Author: Alex Kozyrev <email address hidden>
Date: Thu Apr 4 09:14:27 2019 -0400

    Fix for MTCE race condition in BMC secret handling

    There is intermittent issue in getting BMC password in MTCE.
    The process of obtaining a secret from Barbican stops after
    a secret reference is received. No attempts to retrieve the
    actual payload is atempted. This happens when the secret
    reference reply is received right after BMC queries are
    initiated. It was fine before when we had an one-stage
    process of getting a password from keyring. We cannot
    allow it now because of a two-stage Barbican process.

    Change-Id: I381f69ab6a1a54118b22dd31feefcd93698120ad
    Closes-bug: 1818284
    Signed-off-by: Alex Kozyrev <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
mhg (marvinhg) wrote :

The problem hasn't been seen in recent loads and could not be reproduced in sanity with Load: 20190421T233001Z.
Can close this bug as fixed.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.