cert-mon: audit giving up too early on encountering network issues
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Kyle MacLeod |
Bug Description
Brief Description
----
We are seeing issues with subclouds being slow in large DC environment with slower hardware. During the 24-hour audit cert-mon couldn't communicate with several subclouds. There is a retry mechanism, but it only retries every 3 minutes up to 5 times. This is too short.
After the retry limit is reached the subcloud remains with dc-cert_sync_status endpoint out-of-sync for up to 24 hours, until the next daily audit runs.
Provide a brief description of the issue. Usually, it should not be more than 2 to 3 lines.
Severity
----
Major
Steps to Reproduce
----
Network issues or subclouds very slow to respond or startup on boot.
Expected Behavior
----
cert-mon should audit the subcloud, and the subcloud sync state should go to in-sync
Actual Behavior
----
The subcloud sync state moves to out-of-sync
Reproducibility
----
Intermittent. Depends on hardware/network
System Configuration
----
Large DC system
Load info (eg: 2022-03-
BUILD_ID=
Last Pass
----
New
Timestamp/Logs
----
2022-03-
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
2022-03-10 21:32:59.938 3236085 ERROR sysinv.
Alarms
----
Test Activity
----
Feature Testing/Developer Testing
Workaround
----
A quick workaround is to restart the cert-mon service. This will trigger audits on all the out-of-sync subclouds:
sudo sm-restart service cert-mon
A longer-term workaround is to increase the network_max_retry value in /etc/sysconfig/
sudo vim /etc/sysconfig/
[certmon]
...
network_
Changed in starlingx: | |
assignee: | nobody → Kyle MacLeod (kmacleod) |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.config stx.security |
Fix proposed to branch: master /review. opendev. org/c/starlingx /stx-puppet/ +/833888
Review: https:/