cert-mon: incorrect exception handling on failure to obtain subcloud sysinv URL
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Kyle MacLeod |
Bug Description
Brief Description
-----------------
When bringing a subcloud online, if the DC system-controller cannot talk to the subcloud sysinv subsystem, then cert-mon fails to audit the subcloud, and the normal retry mechanism is skipped. The subcloud will stay out-of-sync until either the subcloud is managed, or the next cert-mon daily audit runs.
Severity
--------
Minor
Steps to Reproduce
-----------------
Importing 50 subclouds
Expected Behavior
-----------------
After the subcloud comes online, cert-mon audits the subloud.
In this case, for some reason after the subcloud comes online, the sysinv API is not yet ready.
The subcloud comes online, but when the cert-mon audit starts it cannot pull the remote sysinv URL from the subcloud (a separate issue, but exposes this one).
Any failure in the remote subcloud sysinv API should result in cert-mon retrying the subcloud for several attempts before giving up. It would likely succeed in this case.
Actual Behavior
---------------
There is a very simple bug in the exception handling here that causes the certmon audit to die instead of rescheduling an audit for this subcloud.
Reproducibility
---------------
Intermittent
System Configuration
-------
Large system testing, 1000 subclouds
Branch/Pull Time/Commit
-------
stx master commit 2021-09-14: Handle network-level issues during subcloud audit
Last Pass
---------
Passed prior to commit 2021-09-14: Handle network-level issues during subcloud audit
Timestamp/Logs
--------------
We should see the subcloud logging a re-attempt, but we don't:
2021-12-
2021-12-
2021-12-
2021-12-
2021-12-
2021-12-
Alarms
------
n/a
Test Activity
-------------
System Engineering
Workaround
----------
The workaround is to either:
1) Manage the subcloud. This will queue an audit and cert-mon will retry.
2) Wait until the daily cert-mon audit which will clear up the issue on the subcloud. This runs every 24 hours. It could run any time, so the max wait is 24 hours.
3) Restart the cert-mon service: sudo sm-restart service cert-mon. This will immediately audit any out-of-sync subclouds.
Changed in starlingx: | |
assignee: | nobody → Kyle MacLeod (kmacleod) |
Changed in starlingx: | |
importance: | Undecided → Low |
tags: | added: stx.7.0 stx.distcloud |
Fix proposed to branch: master /review. opendev. org/c/starlingx /config/ +/821139
Review: https:/