Activity log for bug #1936435

Date Who What changed Old value New value Message
2021-07-15 21:17:19 Ghada Khalil bug added bug
2021-07-15 21:17:40 Ghada Khalil starlingx: assignee Kyle MacLeod (kmacleod)
2021-07-15 21:26:01 Ghada Khalil tags stx.6.0 stx.distcloud
2021-07-15 21:26:17 Ghada Khalil tags stx.6.0 stx.distcloud stx.6.0 stx.containers stx.distcloud
2021-07-15 21:26:28 Ghada Khalil starlingx: importance Undecided High
2021-07-15 21:26:29 Ghada Khalil starlingx: status New Incomplete
2021-07-15 21:26:33 Ghada Khalil starlingx: status Incomplete Triaged
2021-07-15 21:29:47 Ghada Khalil description Brief Description ----------------- This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core. The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core. From a quick look, we may be hitting this issue: https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38 Severity -------- Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds. Steps to Reproduce ------------------ Probably requires a long soak time time to see this happen in the field. Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally. Expected Behavior ----------------- We should not hit this. Actual Behavior --------------- It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code. Reproducibility --------------- Seen in one DC lab, but looks to be reproducible. System Configuration -------------------- Distributed Cloud - this is on the systemcontroller node. Branch/Pull Time/Commit ----------------------- stx master build from 2021-06-01_00-00-08 Last Pass --------- Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored. Timestamp/Logs -------------- 2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last): 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task), 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info() 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file() 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi Test Activity ------------- Found during DC subcloud scaling testing. Workaround ----------- Restart the cert-mon service: sudo sm-restart service cert-mon Brief Description ----------------- This is seen on a DC system. After some period of time (uptime: 28 days), the cert-mon process is seen spinning and taking up full CPU on a single core. The call to start_watch is failing on a kubernetes python client call to load_kube_config. The call is continuously failing, which causes an endless loop and full CPU usage on a core. From a quick look, we may be hitting this issue: https://github.com/kubernetes-client/python/issues/765 which has an outstanding pull request at https://github.com/kubernetes-client/python-base/pull/38 Severity -------- Major. During the failure condition, https admin endpoint certificates would not be renewed, resulting in communication failures to the affected subclouds. Steps to Reproduce ------------------ Probably requires a long soak time time to see this happen in the field. Manually removing the file used in /tmp should reproduce this issue, but you will have to wait for the watch to be restarted. Am trying to reproduce locally. Expected Behavior ----------------- We should not hit this. Actual Behavior --------------- It looks like there may be a cleanup that happens in /tmp after some period of time, which is triggering this bug in the kubernetes-client code. Reproducibility --------------- Seen in one DC lab, but not others with similar uptime. We're still investigating the full set of contributing factors causing this issue System Configuration -------------------- Distributed Cloud - this is on the systemcontroller node. Branch/Pull Time/Commit ----------------------- stx master build from 2021-06-01_00-00-08 Last Pass --------- Unknown. Not clear if enough soak was done on DC systems before and if CPU usage was monitored. Timestamp/Logs -------------- 2021-07-13T20:50:37.000 controller-0 cert-mon: err 1973275 ERROR sysinv.cert_mon.certificate_mon_manager [-] File does not exists: /tmp/tmph49hfi: ConfigException: File does not exists: /tmp/tmph49hfi 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager Traceback (most recent call last): 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/certificate_mon_manager.py", line 259, in monitor_cert 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager on_error=lambda task: self._add_reattempt_task(task), 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib64/python2.7/site-packages/sysinv/cert_mon/watcher.py", line 214, in start_watch 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager config.load_kube_config(KUBE_CONFIG_PATH) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 531, in load_kube_config 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager loader.load_and_set(config) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 413, in load_and_set 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager self._load_cluster_info() 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 392, in _load_cluster_info 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager file_base_path=self._config_base_path).as_file() 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager File "/usr/lib/python2.7/site-packages/kubernetes/config/kube_config.py", line 116, in as_file 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager raise ConfigException("File does not exists: %s" % self._file) 2021-07-13 20:50:37.769 1973275 ERROR sysinv.cert_mon.certificate_mon_manager ConfigException: File does not exists: /tmp/tmph49hfi Test Activity ------------- Found during DC subcloud scaling testing. Workaround ----------- Restart the cert-mon service: sudo sm-restart service cert-mon
2021-07-16 20:31:55 OpenStack Infra starlingx: status Triaged In Progress
2021-07-20 12:47:06 OpenStack Infra starlingx: status In Progress Fix Released
2021-07-20 12:47:07 OpenStack Infra bug watch added https://github.com/kubernetes-client/python/issues/765