cert-mon start_watch call spinning 100% CPU after soak period

Bug #1936435 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Kyle MacLeod

Bug Description

Brief Description
-----------------
This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core.

The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core.

From a quick look, we may be hitting this issue:
https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38

Severity
--------
Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds.

Steps to Reproduce
------------------
Probably requires a long soak time time to see this happen in the field.

Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally.

Expected Behavior
-----------------
We should not hit this.

Actual Behavior
---------------
It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code.

Reproducibility
---------------
Seen in one DC lab, but not others with similar uptime. We're still investigating the full set of contributing factors causing this issue

System Configuration
--------------------
Distributed Cloud - this is on the systemcontroller node.

Branch/Pull Time/Commit
-----------------------
stx master build from 2021-06-01_00-00-08

Last Pass
---------
Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored.

Timestamp/Logs
--------------
2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last):
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task),
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi

Test Activity
-------------
Found during DC subcloud scaling testing.

Workaround
-----------
Restart the cert-mon service:
sudo sm-restart service cert-mon

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Similar issues were reported/fixed in other subsystems previously: https://bugs.launchpad.net/starlingx/+bug/1883599

Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.6.0 / high. will fix in stx master to start given there is an easy workaround that can be applied to stx.5.0

tags: added: stx.6.0 stx.distcloud
tags: added: stx.containers
Changed in starlingx:
importance: Undecided → High
status: New → Incomplete
status: Incomplete → Triaged
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/801131

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/801131
Committed: https://opendev.org/starlingx/config/commit/e255896eea3a2c773be5293df1bad4ed6c9c4c51
Submitter: "Zuul (22348)"
Branch: master

commit e255896eea3a2c773be5293df1bad4ed6c9c4c51
Author: Kyle MacLeod <email address hidden>
Date: Fri Jul 16 10:42:19 2021 -0400

    Set cert-mon temp dir location to /var/run/cert-mon_tmp

    Redirect the k8s python client's use of /tmp to /var/run/cert-mon_tmp
    via setting TMPDIR

    This is a known issue of kubernetes python client:
    https://github.com/kubernetes-client/python/issues/765

    The fix is the same as for
    https://bugs.launchpad.net/starlingx/+bug/1883599
    See commit message there for more details.

    Related-Bug: 1883599
    Closes-Bug: 1936435

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I0e163bd1b4d5a19f07267dd4cd14bad1b8cb20bb

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.