Comment 2 for bug 1951506

Revision history for this message
Kyle MacLeod (kmacleod) wrote (last edit ):

This issue results from the fairly complex interaction between cert-mon and dcmanager subcloud states, and the corresponding fault alarm condition when a cert-mon managed SSL certificate is "out-of-sync".
Note that cert-mon differs from other dcmanager audit components in that it purposefully audits subclouds that are unmanaged and/or offline - this is due to the nature of SSL certificates and their need for basic connectivity.
Behaviour

There is a general issue in stx5.0 with the interaction between subcloud offline state and the cert-mon daily audit that can result in the 'dc-cert sync_status is out-of-sync' alarm being raised after a subcloud has gone offline and before it is deleted. In this case the alarm is never cleaned up.

The order of events is:

    1. The subcloud Intermediate CA cert is invalid (it is unknown why - am assuming that this is a customer issue with the cert validity)
    2. The subcloud is unmanaged.
    3. The subcloud goes offline (reason unknown). But it either stays up, or restarts, because at some point it resumes responding to cert-mon audits.
    4. When the subcloud goes offline, dcmanger clears all alarms for the subcloud(see: https://github.com/starlingx/distcloud/blob/a4b30db3cc01c12da4c1dcb24ba7c8afc751c682/distributedcloud/dcmanager/manager/subcloud_manager.py#L1514), and raises a new alarm for 'dran0306 is offline'
    5. The cert-mon daily audit executes every 24 hours, and includes all subclouds (including both offline and unmanaged). During this audit, the subcloud ICA cert validation fails. cert-mon notifies dcmanager via RPC call. dcmanager raises the 'dc-cert sync_status is out-of-sync' alarm. From the log file, this happens repeatedly due to cert-mon audit behaviour in stx5.0 (the subcloud is requeued for auditing before the failure path).
    6. The unmanaged/offline subcloud is deleted. The '<subcloud>' is offline' alarm is cleared by dcmanager. However, the 'dc-cert sync_status is out-of-sync' alarm is not cleared (since dcmanager does not know about this alarm).
    7. The cert-mon watcher retry mechanism keeps attempting to monitor the subcloud, even though it is deleted. This is noise - nothing really happens with this other than it is confusing in the logs.

Behaviour Summary

These two conditions must be present to see this issue:

    1. Subcloud has an invalid Intermediate CA certificate
        Reason: there is no alarm unless the ICA is invalid
    2. Subcloud connectivity resumes after going offline, before deletion
        Reason: the offline transition clears all alarms. The cert-mon 24-hour audit re-raises the out-of-sync alarm only if subcloud connectivity is present.

Fix

1) The ideal fix is to extend the fault management API to clear all alarms for a given resource tree (i.e. a subcloud). Since a subcloud is being deleted, this API would clean out any and all alarms for a subcloud regardless of underlying state.

2) Until such an API exists, we add a call to set all subcloud audit endpoints to 'unknown' status during the delete process. This has the effect of cleaning up the cert-mon state, and also clears the alarm associated with the dc-cert endpoint.