2021-07-15 21:29:47 |
Ghada Khalil |
description |
Brief Description
-----------------
This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core.
The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core.
From a quick look, we may be hitting this issue:
https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38
Severity
--------
Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds.
Steps to Reproduce
------------------
Probably requires a long soak time time to see this happen in the field.
Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally.
Expected Behavior
-----------------
We should not hit this.
Actual Behavior
---------------
It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code.
Reproducibility
---------------
Seen in one DC lab, but looks to be reproducible.
System Configuration
--------------------
Distributed Cloud - this is on the systemcontroller node.
Branch/Pull Time/Commit
-----------------------
stx master build from 2021-06-01_00-00-08
Last Pass
---------
Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored.
Timestamp/Logs
--------------
2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last):
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task),
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi
Test Activity
-------------
Found during DC subcloud scaling testing.
Workaround
-----------
Restart the cert-mon service:
sudo sm-restart service cert-mon |
Brief Description
-----------------
This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core.
The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core.
From a quick look, we may be hitting this issue:
https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38
Severity
--------
Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds.
Steps to Reproduce
------------------
Probably requires a long soak time time to see this happen in the field.
Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally.
Expected Behavior
-----------------
We should not hit this.
Actual Behavior
---------------
It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code.
Reproducibility
---------------
Seen in one DC lab, but not others with similar uptime. We're still investigating the full set of contributing factors causing this issue
System Configuration
--------------------
Distributed Cloud - this is on the systemcontroller node.
Branch/Pull Time/Commit
-----------------------
stx master build from 2021-06-01_00-00-08
Last Pass
---------
Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored.
Timestamp/Logs
--------------
2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last):
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task),
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file()
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file)
2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi
Test Activity
-------------
Found during DC subcloud scaling testing.
Workaround
-----------
Restart the cert-mon service:
sudo sm-restart service cert-mon |
|