StarlingX

Bug #1976108
Comment #2

Comment 2 for bug 1976108

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-08: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/843667
Committed: https://opendev.org/starlingx/distcloud/commit/fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Submitter: "Zuul (22348)"
Branch: master

commit fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Author: Kyle MacLeod <email address hidden>
Date: Fri May 27 13:42:12 2022 -0400

Preserve successful endpoints across subcloud audit failures

    When an audited endpoint raises exception, we want to
    preserve the successful endpoints that have been audited
    for this subcloud.

    This commit adds exception handling at the endpoint level
    during subcloud audits. If an exception occurs auditing
    an endpoint it now affects only that endpoint; others are
    still audited at their own intervals, and the audit result
    is updated accordingly.

We therefore avoid re-auditing successful endpoints at every
audit retry interval.

    Note: the root cause in the original code is that audits_done is not
    set when self._audit_subcloud throws an exception. When audits_done
    is not set, any successful endpoints are not updated in the
    db_api.subcloud_audits_end_audit. We fix the issue by pushing the
    exception handling down a level and keeping track of successful
    endpoints.

Test cases:

PASS:

    - On a large system having kubernetes upgrade issues, the
      kubernetes endpoint audit is failing with an exception.
      Before this update, the entire subcloud audits are re-run
      every 30s (for all endpoints). After applying this update,
      only the kubernetes endpoint audit is re-tried.

    - Simulate audit exceptions plus multiple audit exceptions
      by introducing fake exception raises during the endpoint
      audits

Closes-Bug: 1976108

Change-Id: I71f5de2d94e27205375e81c10cd2fee85c3259f8
Signed-off-by: Kyle MacLeod <email address hidden>