Flooding of DC state logs when subcloud kubernetes failed

Bug #1976108 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description

After Kubernetes root CA upgrade failed, the affected subclouds are attempting to audit every 20-30 seconds in this state.

Severity

Major: Affect performance of the system controller

Steps to Reproduce

1. Upgrade kubernetes root CA
2. if it fails the affected subclouds are attempting to audit every 20-30 seconds in this state

Expected Behavior

The audit logs should not be flooded with errors.

Actual Behavior

The logs are flooded with error

Reproducibility

(1/1)

System Configuration
 (DC1000-2/1000 AWS subclouds)

Load Info
starlingx master

Last Pass
Unknown

Timestamp/Logs

Workaround

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/843667

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/843667
Committed: https://opendev.org/starlingx/distcloud/commit/fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Submitter: "Zuul (22348)"
Branch: master

commit fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Author: Kyle MacLeod <email address hidden>
Date: Fri May 27 13:42:12 2022 -0400

    Preserve successful endpoints across subcloud audit failures

    When an audited endpoint raises exception, we want to
    preserve the successful endpoints that have been audited
    for this subcloud.

    This commit adds exception handling at the endpoint level
    during subcloud audits. If an exception occurs auditing
    an endpoint it now affects only that endpoint; others are
    still audited at their own intervals, and the audit result
    is updated accordingly.

    We therefore avoid re-auditing successful endpoints at every
    audit retry interval.

    Note: the root cause in the original code is that audits_done is not
    set when self._audit_subcloud throws an exception. When audits_done
    is not set, any successful endpoints are not updated in the
    db_api.subcloud_audits_end_audit. We fix the issue by pushing the
    exception handling down a level and keeping track of successful
    endpoints.

    Test cases:

    PASS:

    - On a large system having kubernetes upgrade issues, the
      kubernetes endpoint audit is failing with an exception.
      Before this update, the entire subcloud audits are re-run
      every 30s (for all endpoints). After applying this update,
      only the kubernetes endpoint audit is re-tried.

    - Simulate audit exceptions plus multiple audit exceptions
      by introducing fake exception raises during the endpoint
      audits

    Closes-Bug: 1976108

    Change-Id: I71f5de2d94e27205375e81c10cd2fee85c3259f8
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.