Debian: mtcAgent segfaults when handling barbican secret fetch failures

Bug #1975520 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------
Trying to provision a BMC has been seen to lead to the mtcAgent segmentation fault in the Debian environment

Severity
--------
Critical: mtcAgent segfault can cause a SWACT

Steps to Reproduce
------------------
Provision a BMC while barbican is unable to provide the BMC password.

Expected Behavior
------------------
Handle the error without segmentation fault.

Actual Behavior
----------------
mtcAgent sometimes exits or segfaults.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Any system that supports provisionable BMCs

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2022-05-11 10:03:49 +0000"

Last Pass
---------
CentOS environment. New bug found in Debian environment.

Timestamp/Logs
--------------
2022-04-11T18:02:13.089 localhost kernel: info [ 2684.071691] mtcAgent[95961]: segfault at 55959eeebc60 ip 00007f96f1d3b208 sp 00007ffc36ffd0c0 error 4 in libc-2.31.so[7f96f1cd6000+14b000]

Test Activity
-------------
Debian integration testing

Workaround
----------
Fix barbican so that secret fetch failures are avoided

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/843134

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/843139

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on metal (master)

Change abandoned by "Eric MacDonald <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/843134
Reason: Accidentally uploaded new review after amending changes without managing the update id properly.

Ghada Khalil (gkhalil)
summary: - mtcAgent segfaults when handling barbican secret fetch failures
+ Debian: mtcAgent segfaults when handling barbican secret fetch failures
tags: added: stx.7.0 stx.debian stx.metal stx.security
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (3.1 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/843139
Committed: https://opendev.org/starlingx/metal/commit/aaf9d080289b63a2d90fc9874b7bc91524bed0a1
Submitter: "Zuul (22348)"
Branch: master

commit aaf9d080289b63a2d90fc9874b7bc91524bed0a1
Author: Eric MacDonald <email address hidden>
Date: Tue May 24 12:10:06 2022 +0000

    Mtce: Fix bmc password fetch error handling

    The mtcAgent process sometimes segfaults while trying to fetch
    the bmc password from a failing barbican process.

    With that issue fixed the mtcAgent sends the bmc access
    credentials to the hardware monitor (hwmond) process which
    then segfaults for a reason similar

    In cases where the process does not segfault but also does not
    get a bmc password, the mtcAgent will flood its log file.

    This update

     1. Prevents the segfault case by properly managing acquired
        json-c object releases. There was one in the mtcAgent and
        another in the hardware monitor (hwmond).

        The json_object_put object release api should only be called
        against objects that were created with very specific apis.
        See new comments in the code.

     2. Avoids log flooding error case by performing a password size
        check rather than assume the password is valid following the
        secret payload receive stage.

     3. Simplifies the secret fsm and error and retry handling.

     4. Deletes useless creation and release of a few unused json
        objects in the common jsonUtil and hwmonJson modules.

    Note: This update temporarily disables sensor and sensorgroup
          suppression support for the debian hardware monitor while
          a suppression type fix in sysinv is being investigated.

    Test Plan:

    PASS: Verify success path bmc password secret fetch
    PASS: Verify secret reference get error handling
    PASS: Verify secret password read error handling
    PASS: Verify 24 hr provision/deprov success path soak
    PASS: Verify 24 hr provision/deprov error path path soak
    PASS: Verify no memory leak over success and failure path soaking
    PASS: Verify failure handling stress soak ; reduced retry delay
    PASS: Verify blocking secret fetch success and error handling
    PASS: Verify non-blocking secret fetch success and error handling
    PASS: Verify secret fetch is set non-blocking
    PASS: Verify success and failure path logging
    PASS: Verify all of jsonUtil module manages object release properly
    PASS: Verify hardware monitor sensor model creation, monitoring,
                 alarming and relearning. This test requires suppress
                 disable in order to create sensor groups in debian.
    PASS: Verify both ipmi and redfish and switch between them with
                 just bm_type change.
    PASS: Verify all above tests in CentOS
    PASS: Verify over 4000 provision/deprovision cycles across both
                 failure and success path handling with no process
                 failures

    Closes-Bug: 1975520
    Signed-off-by: Eric MacDonald <email address hidden>
    Change-Id: Ibbfdaa1de662290f641d845...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.