hwmond coredump generated when handling a failed token request from keystone

Bug #2063475 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description:
------------------
A hardware monitor coredump was generated during the handling of a failed token request.

Severity:
---------
Minor - process is restarted automatically

Steps to Reproduce:
-------------------
Receive a token from keystone that has no token in the header

Expected Behavior
------------------
Token request is retried until success

Actual Behavior
----------------
Process core dumps

Reproducibility
---------------
100% reproducible when the failure occurs.

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
Master at time of issue report

Last Pass
---------
Day 1 failure path behavior.

Timestamp/Logs
--------------
024-02-09T21:22:31.677 [2763684.00091] controller-0 hwmond tok tokenUtil.cpp ( 697) tokenUtil_new_token : Info : controller-0 Requesting Authentication Token
2024-02-09T21:22:33.788 [2763684.00092] controller-0 hwmond tok tokenUtil.cpp ( 455) tokenUtil_handler :Error : controller-0 Token Request Failed - Error Code (78)
2024-02-09T21:22:33.788 [2763684.00093] controller-0 hwmond tok tokenUtil.cpp ( 465) tokenUtil_handler :Error : controller-0 Token Request Failed - no token in header
2024-02-09T21:23:04.216 [2923495.00000] localhost hwmond --- daemon_files.cpp (1091) daemon_files_init : Info : --- Daemon Start-Up --- pid:2923495

Test Activity
-------------
Developer Testing

Workaround
----------
A void sending a token response with no token in header.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/917413

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/917413
Committed: https://opendev.org/starlingx/metal/commit/4e62e3ac9f29c44872a93933a9c4075bd778b293
Submitter: "Zuul (22348)"
Branch: master

commit 4e62e3ac9f29c44872a93933a9c4075bd778b293
Author: Eric MacDonald <email address hidden>
Date: Mon Apr 29 13:09:00 2024 +0000

    Prevent process coredump due to missing token in response header

    Both Maintenance and the Hardware Monitor use a common token refresh
    utility that has been seen to crash the calling process when a token
    'get' request is missing the token in its response header.

    This update avoids that by exiting the token handler at error
    detection point rather than continue handling the response with
    invalid data.

    Significant fault insertion testing was performed on the update
    which lead to some additional improvements in token request error
    handling that both processes benefit from.

    Additional specific fixes include
    - fixed race condition memory leak around authentication error handling
    - differentiate token refresh from failure recovery renewal.
    - fixed a few missing event status / rc updates.

    Test Plan:
     - used mtce fault insertion tools to create failure modes
     - 24+ hr memory leak test run for both success & token error handling
     - all tests were done with both hwmond and mtcAgent

    PASS: Verify build and AIO DX install.
    PASS: Verify reported hwmon coredump issue is avoided/resolved.
    PASS: Verify issue also exists in the mtcAgent and is also
          avoided/resolved by this update.

    Regression:

    PASS: Verify token get failure retry handling:
    PASS: - get first token inline - retry cadence: 5 seconds
    PASS: - refresh token by http - retry cadence: 10, 30 and 1200 secs
    PASS: Verify recovery handling cases:
    PASS: - corrupt token
    PASS: - no token present
    PASS: - no token in header
    PASS: Verify token renewal stress soak ; every 10 seconds for 24+ hrs
    PASS: - repeat over token get failure cases
    PASS: - in each success and failure case verify no memory leaks.
    PASS: Verify authentication error handling soak
          - every 10-60 secs for 24+ hrs
          - token is corrupted followed by a sysinv request to
            exercise authentication error handling and renewal process.
    PASS: Verify no coredumps.
    PASS: Verify logging and token retry.
    PASS: Verify process continues to use the previous token until a new
          one is acquired.
          - Token Refresh is on time.
          - Token Renew is on event.
    PASS: Verify soak of persistent authentication error / token
          renewal cycle. No memory leak or coredumps.

    Closes-Bug: 2063475
    Change-Id: I5eef62518ac606e6b54323b46fbb6f9475b5c1ef

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.10.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.