StarlingX

cert-mon start_watch call spinning 100% CPU after soak period

Bug #1936435 reported by Ghada Khalil on 2021-07-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Kyle MacLeod

Bug Description

Brief Description
-----------------
This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core.

The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core.

From a quick look, we may be hitting this issue:
https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38

Severity
--------
Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds.

Steps to Reproduce
------------------
Probably requires a long soak time time to see this happen in the field.

Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally.

Expected Behavior
-----------------
We should not hit this.

Actual Behavior
---------------
It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code.

Reproducibility
---------------
Seen in one DC lab, but not others with similar uptime. We're still investigating the full set of contributing factors causing this issue

System Configuration
--------------------
Distributed Cloud - this is on the systemcontroller node.

Branch/Pull Time/Commit
-----------------------
stx master build from 2021-06-01_00-00-08

Last Pass
---------
Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored.

Timestamp/Logs
--------------
2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last):
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task),
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi

Test Activity
-------------
Found during DC subcloud scaling testing.

Workaround
-----------
Restart the cert-mon service:
sudo sm-restart service cert-mon

See original description

Tags:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-07-15:

Similar issues were reported/fixed in other subsystems previously: https://bugs.launchpad.net/starlingx/+bug/1883599

Changed in starlingx:
assignee:	nobody → Kyle MacLeod (kmacleod)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-07-15:

screening: stx.6.0 / high. will fix in stx master to start given there is an easy workaround that can be applied to stx.5.0

tags:	added: stx.6.0 stx.distcloud
tags:	added: stx.containers
Changed in starlingx:
importance:	Undecided → High
status:	New → Incomplete
status:	Incomplete → Triaged
description:	updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-16: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/801131

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-20: Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/801131
Committed: https://opendev.org/starlingx/config/commit/e255896eea3a2c773be5293df1bad4ed6c9c4c51
Submitter: "Zuul (22348)"
Branch: master

commit e255896eea3a2c773be5293df1bad4ed6c9c4c51
Author: Kyle MacLeod <email address hidden>
Date: Fri Jul 16 10:42:19 2021 -0400

Set cert-mon temp dir location to /var/run/cert-mon_tmp

Redirect the k8s python client's use of /tmp to /var/run/cert-mon_tmp
via setting TMPDIR

This is a known issue of kubernetes python client:
https://github.com/kubernetes-client/python/issues/765

    The fix is the same as for
    https://bugs.launchpad.net/starlingx/+bug/1883599
    See commit message there for more details.

Related-Bug: 1883599
Closes-Bug: 1936435

Signed-off-by: Kyle MacLeod <email address hidden>
Change-Id: I0e163bd1b4d5a19f07267dd4cd14bad1b8cb20bb

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-kubernetes-client-python #765 Edit

Bug watches keep track of this bug in other bug trackers.