cert-mon: incorrect exception handling on failure to obtain subcloud sysinv URL

Bug #1953691 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Kyle MacLeod

Bug Description

Brief Description
-----------------

When bringing a subcloud online, if the DC system-controller cannot talk to the subcloud sysinv subsystem, then cert-mon fails to audit the subcloud, and the normal retry mechanism is skipped. The subcloud will stay out-of-sync until either the subcloud is managed, or the next cert-mon daily audit runs.

Severity
--------
Minor

Steps to Reproduce
-----------------

Importing 50 subclouds

Expected Behavior
-----------------

After the subcloud comes online, cert-mon audits the subloud.

In this case, for some reason after the subcloud comes online, the sysinv API is not yet ready.

The subcloud comes online, but when the cert-mon audit starts it cannot pull the remote sysinv URL from the subcloud (a separate issue, but exposes this one).

Any failure in the remote subcloud sysinv API should result in cert-mon retrying the subcloud for several attempts before giving up. It would likely succeed in this case.

Actual Behavior
---------------

There is a very simple bug in the exception handling here that causes the certmon audit to die instead of rescheduling an audit for this subcloud.

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Large system testing, 1000 subclouds

Branch/Pull Time/Commit
-----------------------
stx master commit 2021-09-14: Handle network-level issues during subcloud audit

Last Pass
---------
Passed prior to commit 2021-09-14: Handle network-level issues during subcloud audit

Timestamp/Logs
--------------
We should see the subcloud logging a re-attempt, but we don't:

2021-12-08T15:08:09.000 controller-0 cert-mon: info 2021-12-08 15:08:09.948 175173 INFO sysinv.cert_mon.service [req-6aa5fce8-6d44-4778-8a62-9912460797c9 - - - - -] subcloud845 is online. An online audit is queued
2021-12-08T15:08:09.000 controller-0 cert-mon: info 2021-12-08 15:08:09.950 175173 INFO sysinv.cert_mon.subcloud_audit_queue [req-6aa5fce8-6d44-4778-8a62-9912460797c9 - - - - -] Enqueued: SubcloudAuditData: \{name: subcloud845, audit_count: 1}
2021-12-08T15:08:12.000 controller-0 cert-mon: info 175173 INFO sysinv.cert_mon.certificate_mon_manager [-] Auditing subcloud subcloud845, attempt #1 [qsize: 0]
2021-12-08T15:08:12.000 controller-0 cert-mon: err 175173 ERROR sysinv.cert_mon.utils [-] Cannot find sysinv endpoint for subcloud845
2021-12-08T15:08:12.000 controller-0 cert-mon: info 175173 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:a114::d00]:8119/v1.0/subclouds/subcloud845
2021-12-08T15:17:21.000 controller-0 cert-mon: info 175173 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:a114::d00]:8119/v1.0/subclouds/subcloud10

Alarms
------
n/a

Test Activity
-------------
System Engineering

Workaround
----------
The workaround is to either:

1) Manage the subcloud. This will queue an audit and cert-mon will retry.

2) Wait until the daily cert-mon audit which will clear up the issue on the subcloud. This runs every 24 hours. It could run any time, so the max wait is 24 hours.

3) Restart the cert-mon service: sudo sm-restart service cert-mon. This will immediately audit any out-of-sync subclouds.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/821139

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/821139
Committed: https://opendev.org/starlingx/config/commit/81d7521c608693dbaef0c48fc9a0fc0b35bb844a
Submitter: "Zuul (22348)"
Branch: master

commit 81d7521c608693dbaef0c48fc9a0fc0b35bb844a
Author: Kyle MacLeod <email address hidden>
Date: Wed Dec 8 14:31:21 2021 -0500

    Fix incorrect exception handling on failure to obtain sysinv URL

    This fix ensures the 'subcloud_sysinv_url' variable is initialized
    before being referenced in the exception handling.

    Test Plan:

    PASS: Simulate the condition by raising the exception from the utils.py
    dc_get_subcloud_sysinv_url function call. As expected, cert-mon now
    properly re-attempts the subcloud audit.

    Closes-Bug: 1953691

    Change-Id: Ic05390ec90f5b7c821a81ef66416440ffa6cfaa3
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Kyle MacLeod (kmacleod)
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.7.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.