Stale dc-cert alarm for an already unmanaged and deleted subcloud

Bug #1951506 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Kyle MacLeod

Bug Description

Brief Description
-----------------
After a subcloud was unmanaged and deleted, a dc-cert out-of-sync alarm (280.002) continued to be reported. The expected behavior is not see any alarms related to deleted subclouds as this creates a confusion for an operator

Severity
--------
Minor

Steps to Reproduce
------------------
- dcmanager subcloud unmanage
- dcmanager subcloud delete
- Check alarms

Expected Behavior
------------------
Alarms related to the deleted subcloud are removed

Actual Behavior
----------------
A dc-cert out-of-sync alarm is still present

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
stx.5.0; not sure if issue also exists in stx master

Last Pass
---------
Unknown

Timestamp/Logs
--------------
cert-mon.log.* shows the following errors related to the subcloud even after a deletion:
2021-10-02T00:03:02.000 controller-0 cert-mon: info 187857 INFO sysinv.cert_mon.watcher [-] action ADDED (subcloud1-adminep-ca-certificate)
2021-10-02T00:03:02.000 controller-0 cert-mon: info 187857 INFO sysinv.cert_mon.watcher [-] subcloud subcloud1 action ADDED (subcloud1-adminep-ca-certificate)
2021-10-02T00:03:02.000 controller-0 cert-mon: info 187857 INFO sysinv.cert_mon.watcher [-] subcloud subcloud1 action ADDED (subcloud1-adminep-ca-certificate)
2021-10-02T00:03:03.000 controller-0 cert-mon: err 187857 ERROR sysinv.cert_mon.watcher [-] 13 Reattempt failed Cannot find sysinv endpoint for subcloud1. action ADDED (subcloud1-adminep-ca-certificate)
2021-10-02T00:03:03.000 controller-0 cert-mon: err 187857 ERROR sysinv.cert_mon.watcher [-] 13 Reattempt failed Cannot find sysinv endpoint for subcloud1. action ADDED (subcloud1-adminep-ca-certificate)
2021-10-02T00:03:03.000 controller-0 cert-mon: err 187857 ERROR sysinv.cert_mon.watcher [-] Cannot find sysinv endpoint for subcloud1: Exception: Cannot find sysinv endpoint for subcloud1
2021-10-02T00:03:03.000 controller-0 cert-mon: err 187857 ERROR sysinv.cert_mon.watcher [-] Cannot find sysinv endpoint for subcloud1: Exception: Cannot find sysinv endpoint for subcloud1
2021-10-02 00:03:03.292 187857 ERROR sysinv.cert_mon.watcher Exception: Cannot find sysinv endpoint for subcloud1
2021-10-02T00:03:03.000 controller-0 cert-mon: info 187857 INFO sysinv.cert_mon.watcher [-] Attempt to update intermediate CA cert for subcloud1 has failed

Test Activity
-------------
Testing

Workaround
----------
Manually delete the alarm

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
importance: Undecided → Low
status: New → Triaged
tags: added: stx.distcloud
description: updated
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/820534

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Kyle MacLeod (kmacleod) wrote (last edit ):

This issue results from the fairly complex interaction between cert-mon and dcmanager subcloud states, and the corresponding fault alarm condition when a cert-mon managed SSL certificate is "out-of-sync".
Note that cert-mon differs from other dcmanager audit components in that it purposefully audits subclouds that are unmanaged and/or offline - this is due to the nature of SSL certificates and their need for basic connectivity.
Behaviour

There is a general issue in stx5.0 with the interaction between subcloud offline state and the cert-mon daily audit that can result in the 'dc-cert sync_status is out-of-sync' alarm being raised after a subcloud has gone offline and before it is deleted. In this case the alarm is never cleaned up.

The order of events is:

    1. The subcloud Intermediate CA cert is invalid (it is unknown why - am assuming that this is a customer issue with the cert validity)
    2. The subcloud is unmanaged.
    3. The subcloud goes offline (reason unknown). But it either stays up, or restarts, because at some point it resumes responding to cert-mon audits.
    4. When the subcloud goes offline, dcmanger clears all alarms for the subcloud(see: https://github.com/starlingx/distcloud/blob/a4b30db3cc01c12da4c1dcb24ba7c8afc751c682/distributedcloud/dcmanager/manager/subcloud_manager.py#L1514), and raises a new alarm for 'dran0306 is offline'
    5. The cert-mon daily audit executes every 24 hours, and includes all subclouds (including both offline and unmanaged). During this audit, the subcloud ICA cert validation fails. cert-mon notifies dcmanager via RPC call. dcmanager raises the 'dc-cert sync_status is out-of-sync' alarm. From the log file, this happens repeatedly due to cert-mon audit behaviour in stx5.0 (the subcloud is requeued for auditing before the failure path).
    6. The unmanaged/offline subcloud is deleted. The '<subcloud>' is offline' alarm is cleared by dcmanager. However, the 'dc-cert sync_status is out-of-sync' alarm is not cleared (since dcmanager does not know about this alarm).
    7. The cert-mon watcher retry mechanism keeps attempting to monitor the subcloud, even though it is deleted. This is noise - nothing really happens with this other than it is confusing in the logs.

Behaviour Summary

These two conditions must be present to see this issue:

    1. Subcloud has an invalid Intermediate CA certificate
        Reason: there is no alarm unless the ICA is invalid
    2. Subcloud connectivity resumes after going offline, before deletion
        Reason: the offline transition clears all alarms. The cert-mon 24-hour audit re-raises the out-of-sync alarm only if subcloud connectivity is present.

Fix

1) The ideal fix is to extend the fault management API to clear all alarms for a given resource tree (i.e. a subcloud). Since a subcloud is being deleted, this API would clean out any and all alarms for a subcloud regardless of underlying state.

2) Until such an API exists, we add a call to set all subcloud audit endpoints to 'unknown' status during the delete process. This has the effect of cleaning up the cert-mon state, and also clears the alarm associated with the dc-cert endpoint.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/820534
Committed: https://opendev.org/starlingx/distcloud/commit/2beccc63e1157ffa603b806d41a1dc8811977c95
Submitter: "Zuul (22348)"
Branch: master

commit 2beccc63e1157ffa603b806d41a1dc8811977c95
Author: Kyle MacLeod <email address hidden>
Date: Fri Dec 3 16:25:45 2021 -0500

    Fix: Ensure dc-cert alarm is cleared on subcloud delete

    This fix updates the subcloud delete operation to clear both the
    'offline' and dc-cert 'out-of-sync' alarms before before deleting a
    subcloud.

    This handles the case where cert-mon regains connectivity to
    an offline subcloud with underlying SSL cert issues regains
    before the subcloud is deleted.

    Other minor changes:
    - fixed indentation in _prepare_for_deployment
    - rename _update_online_managed_subcloud to
      _do_update_subcloud_endpoint_status since this method also
      operates on offline/unmanaged subclouds (for dc-cert case)
    - clarify ordering for a complex 'if' statement updating
      subcloud sync status, and update local documentation
      covering how the rules are applied

    Test Plan:

    PASS:
    - Verify fault behaviour for the following:
        - Manage a new subcloud.
        - Unmanage a subcloud. Re-manage it.
        - Attempt to delete an online subcloud (not allowed)
    - Verify subcloud manage/unmanage/delete with a non dc-cert
      endpoint out-of sync (used patching status)
    - Delete subcloud with all dc endpoints 'unknown' (default case)
    - Delete subcloud with dc-cert endpoint 'out-of-sync'
        - verify that the subcloud out-of-sync alarm is cleared

    Closes-Bug: 1951506

    Change-Id: I082cc85aeb2d85febdcb82d7bac87b046f1dd5a4
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.7.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.