Preserve successful endpoints across subcloud audit failures
When an audited endpoint raises exception, we want to
preserve the successful endpoints that have been audited
for this subcloud.
This commit adds exception handling at the endpoint level
during subcloud audits. If an exception occurs auditing
an endpoint it now affects only that endpoint; others are
still audited at their own intervals, and the audit result
is updated accordingly.
We therefore avoid re-auditing successful endpoints at every
audit retry interval.
Note: the root cause in the original code is that audits_done is not
set when self._audit_subcloud throws an exception. When audits_done
is not set, any successful endpoints are not updated in the
db_api.subcloud_audits_end_audit. We fix the issue by pushing the
exception handling down a level and keeping track of successful
endpoints.
Test cases:
PASS:
- On a large system having kubernetes upgrade issues, the
kubernetes endpoint audit is failing with an exception.
Before this update, the entire subcloud audits are re-run
every 30s (for all endpoints). After applying this update,
only the kubernetes endpoint audit is re-tried.
- Simulate audit exceptions plus multiple audit exceptions
by introducing fake exception raises during the endpoint
audits
Reviewed: https:/ /review. opendev. org/c/starlingx /distcloud/ +/843667 /opendev. org/starlingx/ distcloud/ commit/ fa76a6415e6b58e 90e4e310c500bcc 25cb5becbb
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit fa76a6415e6b58e 90e4e310c500bcc 25cb5becbb
Author: Kyle MacLeod <email address hidden>
Date: Fri May 27 13:42:12 2022 -0400
Preserve successful endpoints across subcloud audit failures
When an audited endpoint raises exception, we want to
preserve the successful endpoints that have been audited
for this subcloud.
This commit adds exception handling at the endpoint level
during subcloud audits. If an exception occurs auditing
an endpoint it now affects only that endpoint; others are
still audited at their own intervals, and the audit result
is updated accordingly.
We therefore avoid re-auditing successful endpoints at every
audit retry interval.
Note: the root cause in the original code is that audits_done is not subcloud throws an exception. When audits_done api.subcloud_ audits_ end_audit. We fix the issue by pushing the
set when self._audit_
is not set, any successful endpoints are not updated in the
db_
exception handling down a level and keeping track of successful
endpoints.
Test cases:
PASS:
- On a large system having kubernetes upgrade issues, the
kubernetes endpoint audit is failing with an exception.
Before this update, the entire subcloud audits are re-run
every 30s (for all endpoints). After applying this update,
only the kubernetes endpoint audit is re-tried.
- Simulate audit exceptions plus multiple audit exceptions
by introducing fake exception raises during the endpoint
audits
Closes-Bug: 1976108
Change-Id: I71f5de2d94e272 05375e81c10cd2f ee85c3259f8
Signed-off-by: Kyle MacLeod <email address hidden>