cert-mon: audit giving up too early on encountering network issues

Bug #1964811 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description
----

We are seeing issues with subclouds being slow in large DC environment with slower hardware. During the 24-hour audit cert-mon couldn't communicate with several subclouds. There is a retry mechanism, but it only retries every 3 minutes up to 5 times. This is too short.

After the retry limit is reached the subcloud remains with dc-cert_sync_status endpoint out-of-sync for up to 24 hours, until the next daily audit runs.
Provide a brief description of the issue. Usually, it should not be more than 2 to 3 lines.

Severity
----

Major

Steps to Reproduce
----

Network issues or subclouds very slow to respond or startup on boot.

Expected Behavior
----

cert-mon should audit the subcloud, and the subcloud sync state should go to in-sync

Actual Behavior
----

The subcloud sync state moves to out-of-sync

Reproducibility
----

Intermittent. Depends on hardware/network

System Configuration
----

Large DC system

Load info (eg: 2022-03-10_20-00-07)

BUILD_ID="2022-01-10_02-00-35"

Last Pass
----

New

Timestamp/Logs
----

2022-03-10T21:32:59.000 controller-0 cert-mon: err 3236085 ERROR sysinv.cert_mon.certificate_mon_manager [-] Cannot retrieve ssl certificate for subcloud2036 via: https://[fdff:719a:bf60:36::2]:6386/v1; maximum retry limit exceeded [5], giving up: timeout: timed out
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last):
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 220, in _subcloud_audit
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager timeout_secs=CONF.certmon.certificate_timeout_secs)
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/utils.py", line 542, in get_endpoint_certificate
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager sock = socket.create_connection((host, port), timeout=timeout_secs)
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/eventlet/green/socket.py", line 63, in create_connection
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager raise err
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager timeout: timed out
2022-03-10 21:32:59.938 3236085 ERROR sysinv.cert_mon.certificate_mon_manager

Alarms
----

Test Activity
----

Feature Testing/Developer Testing

Workaround
----

A quick workaround is to restart the cert-mon service. This will trigger audits on all the out-of-sync subclouds:

sudo sm-restart service cert-mon

A longer-term workaround is to increase the network_max_retry value in /etc/sysconfig/cert-mon.conf from 5 to something longer, like 30:

sudo vim /etc/sysconfig/cert-mon

[certmon]
...
network_max_retry=30

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/833888

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/833889

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/833888
Committed: https://opendev.org/starlingx/stx-puppet/commit/a5c4cd1bbaaaf06fbd42ab8e10ea994aa1420ee6
Submitter: "Zuul (22348)"
Branch: master

commit a5c4cd1bbaaaf06fbd42ab8e10ea994aa1420ee6
Author: Kyle MacLeod <email address hidden>
Date: Tue Mar 15 13:21:25 2022 -0400

    Extend cert-mon network_max_retry default value

    We are running into cases where slow hardware means that subclouds can
    take a long time to complete their startup. In such cases, cert-mon
    is giving up on the initial audit after 15m, leaving the subcloud in
    an out-of-sync state until the next daily cert-mon audit run.

    The network_max_retry works with the network_retry_interval (in seconds)
    to affect the total time spent retrying a subcloud before giving up:

      network_retry_interval * (network_retry_interval/60) = time in minutes

    So we are increasing the overall retry time from 15m to 90m:

       30 retries * 180s/60m per retry = 90m

    Test Plan:

    PASS:
    - install using new values, verify that the default is now changed
    - verify retry limit is reached
    - verify case where daily audit is initiated while the subloud is in the
      reattempt state

    Closes-Bug: 1964811

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I3d279a4c0f9f041047df4b5a3e850396d91bbb24

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/833889
Committed: https://opendev.org/starlingx/config/commit/1bafdbde4b42b58f80292a5c69221b55f44b957f
Submitter: "Zuul (22348)"
Branch: master

commit 1bafdbde4b42b58f80292a5c69221b55f44b957f
Author: Kyle MacLeod <email address hidden>
Date: Tue Mar 15 13:20:18 2022 -0400

    Extend cert-mon network_max_retry default value

    We are running into cases where slow hardware means that subclouds can
    take a long time to complete their startup. In such cases, cert-mon
    is giving up on the initial audit after 15m, leaving the subcloud in
    an out-of-sync state until the next daily cert-mon audit run.

    The network_max_retry works with the network_retry_interval (in seconds)
    to affect the total time spent retrying a subcloud before giving up:

      network_retry_interval * (network_retry_interval/60) = time in minutes

    So we are increasing the overall retry time from 15m to 90m:

       30 retries * 180s/60m per retry = 90m

    Test Plan:

    PASS:
    - install using new values, verify that the default is now changed
    - verify retry limit is reached
    - verify case where daily audit is initiated while the subloud is in the
      reattempt state

    Partial-Bug: 1964811

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I706334eb9b665a4303a971255fa66fc2f27f3976

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
importance: Undecided → Medium
tags: added: stx.7.0 stx.config stx.security
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.