cert-mon start_watch call spinning 100% CPU after soak period
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Kyle MacLeod |
Bug Description
Brief Description
-----------------
This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core.
The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core.
From a quick look, we may be hitting this issue:
https:/
Severity
--------
Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds.
Steps to Reproduce
------------------
Probably requires a long soak time time to see this happen in the field.
Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally.
Expected Behavior
-----------------
We should not hit this.
Actual Behavior
---------------
It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code.
Reproducibility
---------------
Seen in one DC lab, but not others with similar uptime. We're still investigating the full set of contributing factors causing this issue
System Configuration
-------
Distributed Cloud - this is on the systemcontroller node.
Branch/Pull Time/Commit
-------
stx master build from 2021-06-01_00-00-08
Last Pass
---------
Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored.
Timestamp/Logs
--------------
2021-07-
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
2021-07-13 20:50:37.769 1973275 ERROR sysinv.
Test Activity
-------------
Found during DC subcloud scaling testing.
Workaround
-----------
Restart the cert-mon service:
sudo sm-restart service cert-mon
Similar issues were reported/fixed in other subsystems previously: https:/ /bugs.launchpad .net/starlingx/ +bug/1883599