StarlingX

Flooding of DC state logs when subcloud kubernetes failed

Bug #1976108 reported by Kyle MacLeod on 2022-05-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Kyle MacLeod

Bug Description

Brief Description

After Kubernetes root CA upgrade failed, the affected subclouds are attempting to audit every 20-30 seconds in this state.

Severity

Major: Affect performance of the system controller

Steps to Reproduce

1. Upgrade kubernetes root CA
2. if it fails the affected subclouds are attempting to audit every 20-30 seconds in this state

Expected Behavior

The audit logs should not be flooded with errors.

Actual Behavior

The logs are flooded with error

Reproducibility

(1/1)

System Configuration
(DC1000-2/1000 AWS subclouds)

Load Info
starlingx master

Last Pass
Unknown

Timestamp/Logs

Workaround

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-27: Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/843667

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-08: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/843667
Committed: https://opendev.org/starlingx/distcloud/commit/fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Submitter: "Zuul (22348)"
Branch: master

commit fa76a6415e6b58e90e4e310c500bcc25cb5becbb
Author: Kyle MacLeod <email address hidden>
Date: Fri May 27 13:42:12 2022 -0400

Preserve successful endpoints across subcloud audit failures

    When an audited endpoint raises exception, we want to
    preserve the successful endpoints that have been audited
    for this subcloud.

    This commit adds exception handling at the endpoint level
    during subcloud audits. If an exception occurs auditing
    an endpoint it now affects only that endpoint; others are
    still audited at their own intervals, and the audit result
    is updated accordingly.

We therefore avoid re-auditing successful endpoints at every
audit retry interval.

    Note: the root cause in the original code is that audits_done is not
    set when self._audit_subcloud throws an exception. When audits_done
    is not set, any successful endpoints are not updated in the
    db_api.subcloud_audits_end_audit. We fix the issue by pushing the
    exception handling down a level and keeping track of successful
    endpoints.

Test cases:

PASS:

    - On a large system having kubernetes upgrade issues, the
      kubernetes endpoint audit is failing with an exception.
      Before this update, the entire subcloud audits are re-run
      every 30s (for all endpoints). After applying this update,
      only the kubernetes endpoint audit is re-tried.

    - Simulate audit exceptions plus multiple audit exceptions
      by introducing fake exception raises during the endpoint
      audits

Closes-Bug: 1976108

Change-Id: I71f5de2d94e27205375e81c10cd2fee85c3259f8
Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-06-28

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.distcloud
Changed in starlingx:
assignee:	nobody → Kyle MacLeod (kmacleod)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.