cert-mon geo-redundancy: watchers are being handled on standby systemcontroller

Bug #2060068 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description

During geo-redundancy testing we are seeing the cert-mon watchers fire and be handled by the standby controller.

The watchers are kubernetes 'watch' entities that fire an event when the cert-mon/cert-manager secrets which hold the certificates are changed. Because these watches are firing, we are handling them on the standby system controller. They need to be ignored on the standby.

Severity

Major

Steps to Reproduce

This is a geo-redundant system. The easiest way to see this is to restart the cert-mon service.

Expected Behavior

The cert-mon watches should be ignored for any subcloud that is in non-valid deploy-status

Actual Behavior

The deploy-status is ignored when handling cert-mon watch events. The system controller attempts to handle the DC_Certwatcher events for subclouds that are standby on this system.

Reproducibility

Reproducible

System Configuration

Geo-redundant configuration

Load info (eg: 2022-03-10_20-00-07)

Master load

Last Pass

New

Logs

2024-03-28T19:20:07.952 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[c336b7f8285f4b909fc980fda57ebfb2]: subcloud is not online
2024-03-28T19:20:07.952 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c367e92af29844f28057792a61de17b3
2024-03-28T19:20:08.232 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[c367e92af29844f28057792a61de17b3]: subcloud is not online
2024-03-28T19:20:08.232 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c36e4d851c5b4efd88e0e63de3c74768
2024-03-28T19:20:08.437 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[c36e4d851c5b4efd88e0e63de3c74768]: subcloud is not online
2024-03-28T19:20:08.438 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c36f5dad82934246901108c9d1559dd2
2024-03-28T19:20:08.646 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew do_action: action EXISTING (c36f5dad82934246901108c9d1559dd2-adminep-ca-certificate)
hash: ca_crt: 8286e79fb954b8cf9df8ea973458ae89 tls_crt 702b356a74ed896b3f709931dc238fb5 tls_key c5a4e60f7020ea584e00ab4f54404354
created at 2024-03-13 16:29:43 last operation Apply last update at 2024-03-13 16:29:43
2024-03-28T19:20:08.646 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] update_certificate: subcloud c36f5dad82934246901108c9d1559dd2 action EXISTING (c36f5dad82934246901108c9d1559dd2-adminep-ca-certificate)
hash: ca_crt: 8286e79fb954b8cf9df8ea973458ae89 tls_crt 702b356a74ed896b3f709931dc238fb5 tls_key c5a4e60f7020ea584e00ab4f54404354
created at 2024-03-13 16:29:43 last operation Apply last update at 2024-03-13 16:29:43
2024-03-28T19:20:09.134 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] Update c36f5dad82934246901108c9d1559dd2 intermediate CA cert request succeed
2024-03-28T19:20:09.134 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.subcloud_audit_queue [-] Enqueued: SubcloudAuditData: {name: c36f5dad82934246901108c9d1559dd2, audit_count: 1}
2024-03-28T19:20:09.135 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c38a5a8dbe0343bba7991a9997493688
2024-03-28T19:20:09.336 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[c38a5a8dbe0343bba7991a9997493688]: subcloud is not online
2024-03-28T19:20:09.337 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c3900b6586c749fc9af2e18969971ce2
2024-03-28T19:20:09.439 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.certificate_mon_manager [-] Auditing subcloud c31754a0a12040a396b170d65770f6f4, attempt #1 [qsize: 1]
2024-03-28T19:20:09.439 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c31754a0a12040a396b170d65770f6f4
2024-03-28T19:20:09.440 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.certificate_mon_manager [-] Auditing subcloud c36f5dad82934246901108c9d1559dd2, attempt #1 [qsize: 0]
2024-03-28T19:20:09.440 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:df0::2]:8119/v1.0/subclouds/c36f5dad82934246901108c9d1559dd2
2024-03-28T19:20:09.828 controller-0 cert-mon: info 3658130 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[c3900b6586c749fc9af2e18969971ce2]: subcloud is not online

Test Activity

Scale testing

Workaround

n/a

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/914907
Committed: https://opendev.org/starlingx/config/commit/03443ef16c0c47d15631eb9001b413a3b8ea39fc
Submitter: "Zuul (22348)"
Branch: master

commit 03443ef16c0c47d15631eb9001b413a3b8ea39fc
Author: Kyle MacLeod <email address hidden>
Date: Tue Apr 2 11:52:39 2024 -0400

    Filter cert-mon for geo-redundancy in audit and DC_CertWatcher

    This commit adds a filter for querying all subclouds from dcmanager, to
    account for secondary subclouds that should not be audited by cert-mon
    for this system controller. The filter is performed against a list of
    invalid deploy states that should be considered when querying
    the list of subcloud from dcmanager.

    Likewise, the DC_CertWatcher -> DCIntermediateCertRenew flow must ensure
    that subclouds which are secondary to this system controller are ignored
    by the kubernetes watch in place for the DC intermediate cert renewal
    detection. Subclouds are filtered by the watch based on their online
    state and their deploy-status. A subcloud with invalid deploy state is
    ignored by this system controller.

    Test Cases

    PASS:
    - Trigger audits on service restart. Verify that offline/secondary
      subclouds are excluded.
    - Ensure full daily audit is executed. Verify that all subclouds
      belonging to this system controller are audited. Secondary subclouds
      are not audited.
    - Verify that DC_CertWatcher -> DCIntermediateCertRenew watch fires are
      ignored for offline and/or invalid deploy state

    Closes-Bug: 2060068

    Change-Id: Iffe3d7c76db8d2f17aed0bfebc792af0f9d75ca2
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.