Comment 2 for bug 2063475

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/917413
Committed: https://opendev.org/starlingx/metal/commit/4e62e3ac9f29c44872a93933a9c4075bd778b293
Submitter: "Zuul (22348)"
Branch: master

commit 4e62e3ac9f29c44872a93933a9c4075bd778b293
Author: Eric MacDonald <email address hidden>
Date: Mon Apr 29 13:09:00 2024 +0000

    Prevent process coredump due to missing token in response header

    Both Maintenance and the Hardware Monitor use a common token refresh
    utility that has been seen to crash the calling process when a token
    'get' request is missing the token in its response header.

    This update avoids that by exiting the token handler at error
    detection point rather than continue handling the response with
    invalid data.

    Significant fault insertion testing was performed on the update
    which lead to some additional improvements in token request error
    handling that both processes benefit from.

    Additional specific fixes include
    - fixed race condition memory leak around authentication error handling
    - differentiate token refresh from failure recovery renewal.
    - fixed a few missing event status / rc updates.

    Test Plan:
     - used mtce fault insertion tools to create failure modes
     - 24+ hr memory leak test run for both success & token error handling
     - all tests were done with both hwmond and mtcAgent

    PASS: Verify build and AIO DX install.
    PASS: Verify reported hwmon coredump issue is avoided/resolved.
    PASS: Verify issue also exists in the mtcAgent and is also
          avoided/resolved by this update.

    Regression:

    PASS: Verify token get failure retry handling:
    PASS: - get first token inline - retry cadence: 5 seconds
    PASS: - refresh token by http - retry cadence: 10, 30 and 1200 secs
    PASS: Verify recovery handling cases:
    PASS: - corrupt token
    PASS: - no token present
    PASS: - no token in header
    PASS: Verify token renewal stress soak ; every 10 seconds for 24+ hrs
    PASS: - repeat over token get failure cases
    PASS: - in each success and failure case verify no memory leaks.
    PASS: Verify authentication error handling soak
          - every 10-60 secs for 24+ hrs
          - token is corrupted followed by a sysinv request to
            exercise authentication error handling and renewal process.
    PASS: Verify no coredumps.
    PASS: Verify logging and token retry.
    PASS: Verify process continues to use the previous token until a new
          one is acquired.
          - Token Refresh is on time.
          - Token Renew is on event.
    PASS: Verify soak of persistent authentication error / token
          renewal cycle. No memory leak or coredumps.

    Closes-Bug: 2063475
    Change-Id: I5eef62518ac606e6b54323b46fbb6f9475b5c1ef