IPV6: compute-2 in reboot loop due to critical 'kubelet' process failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
Kubelet process failure on compute-2 after unlock auto recovery triggered to recover Kubelet by reboot compute-2 and never recovered. Compute-2 was in reboot loop.
compute-2:~$ ps -ef | grep kubelet
sysadmin 45596 45415 0 21:05 ttyS0 00:00:00 grep --color=auto kubelet
compute-2:~$
fm alarm-list
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 200.004 | compute-2 experienced a service-affecting failure. Auto-recovery in progress. Manual | host=compute-2 | critical | 2019-09-09T21:16 |
| | Lock and Unlock may be required if auto-recovery is unsuccessful. | | | :33.704384 |
| | | | | |
| 200.006 | compute-2 critical 'kubelet' process has failed and could not be auto-recovered | host=compute-
| | gracefully. Auto-recovery progression by host reboot is required and in progress. Manual | kubelet | | :33.600828 |
| | Lock and Unlock may be required if auto-recovery is unsuccessful. | | | |
| | | | | |
| 100.114 | NTP address 2607:5300:60:97 is not a valid or a reachable NTP server. | host=controller
| | | 2607:5300:60:97 | | :24.678864 |
| | | | | |
| 100.114 | NTP address 2600:3c00::f03c is not a valid or a reachable NTP server. | host=controller
| | | 2600:3c00::f03c | | :55.021560 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-09T19:56 |
| | | | | :31.703469 |
| | | | | |
+------
Severity
--------
Critical
Steps to Reproduce
------------------
1. Follow install procedure for regular system with IPv6 configuration .
2. Install controller-0 and configure with ansible
3. Install all other nodes
4. After compute-2 unlock as per description compute-2 failure
System Configuration
-------
Regular system with IPv6 configuration
Expected Behavior
------------------
Kubelet process up and running
Actual Behavior
----------------
As per description Kubelet process failing
Reproducibility
---------------
Tested only once in this load.
System Configuration
-------
Regular system IPV6 - wolfpass-3-7
Load
----
Build was on " 2019-09-09_00-10-00
Last Pass
---------
Build was on "2019-09-
Timestamp/Logs
--------------
2019-09-
Test Activity
-------------
Regression test
summary: |
- IPV6:compute-2 critical 'kubelet' process has failed + IPV6: compute-2 critical 'kubelet' process has failed |
description: | updated |
description: | updated |
tags: | added: stx.retestneeded |
Changed in starlingx: | |
assignee: | Bart Wensley (bartwensley) → Eric MacDonald (rocksolidmtce) |
See these logs in compute-2 daemon log. Still unclear why kubelet is failing:
2019-09- 09T20:36: 23.832 compute-2 kubelet[26386]: info F0909 20:36:23.832769 26386 server.go:198] failed to load Kubelet config file /var/lib/ kubelet/ config. yaml, error kubelet config file "/var/lib/ kubelet/ config. yaml" was empty 09T20:36: 23.838 compute-2 systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a 09T20:36: 23.865 compute-2 systemd[1]: notice Unit kubelet.service entered failed state. 09T20:36: 23.865 compute-2 systemd[1]: warning kubelet.service failed.
2019-09-
2019-09-
2019-09-
2019-09- 09T20:37: 24.056 compute-2 kubelet[50919]: info F0909 20:37:24.056760 50919 server.go:273] failed to run Kubelet: failed to initialize client certificate manager: could not convert data from "/var/lib/ kubelet/ pki/kubelet- client- current. pem" into cert/key pair: tls: failed to find any PEM data in certificate input 09T20:37: 24.061 compute-2 systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a 09T20:37: 24.077 compute-2 systemd[1]: notice Unit kubelet.service entered failed state.
2019-09-
2019-09-