StarlingX

IPV6: compute-2 in reboot loop due to critical 'kubelet' process failure

Bug #1843344 reported by Anujeyan Manokeran on 2019-09-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Eric MacDonald

Bug Description

Brief Description
-----------------
Kubelet process failure on compute-2 after unlock auto recovery triggered to recover Kubelet by reboot compute-2 and never recovered. Compute-2 was in reboot loop.

compute-2:~$ ps -ef | grep kubelet
sysadmin 45596 45415 0 21:05 ttyS0 00:00:00 grep --color=auto kubelet
compute-2:~$

fm alarm-list
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+
| 200.004 | compute-2 experienced a service-affecting failure. Auto-recovery in progress. Manual | host=compute-2 | critical | 2019-09-09T21:16 |
| | Lock and Unlock may be required if auto-recovery is unsuccessful. | | | :33.704384 |
| | | | | |
| 200.006 | compute-2 critical 'kubelet' process has failed and could not be auto-recovered | host=compute-2.process= | critical | 2019-09-09T21:16 |
| | gracefully. Auto-recovery progression by host reboot is required and in progress. Manual | kubelet | | :33.600828 |
| | Lock and Unlock may be required if auto-recovery is unsuccessful. | | | |
| | | | | |
| 100.114 | NTP address 2607:5300:60:97 is not a valid or a reachable NTP server. | host=controller-1.ntp= | minor | 2019-09-09T20:35 |
| | | 2607:5300:60:97 | | :24.678864 |
| | | | | |
| 100.114 | NTP address 2600:3c00::f03c is not a valid or a reachable NTP server. | host=controller-0.ntp= | minor | 2019-09-09T20:12 |
| | | 2600:3c00::f03c | | :55.021560 |
| | | | | |
| 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-09-09T19:56 |
| | | | | :31.703469 |
| | | | | |
+----------+------------------------------------------------------------------------------------------+-------------------------+----------+------------------+

Severity
--------
Critical

Steps to Reproduce
------------------
1. Follow install procedure for regular system with IPv6 configuration .
2. Install controller-0 and configure with ansible
3. Install all other nodes
4. After compute-2 unlock as per description compute-2 failure

System Configuration
--------------------
Regular system with IPv6 configuration

Expected Behavior
------------------
Kubelet process up and running

Actual Behavior
----------------
As per description Kubelet process failing

Reproducibility
---------------
Tested only once in this load.

System Configuration
--------------------
Regular system IPV6 - wolfpass-3-7

Load
----
Build was on " 2019-09-09_00-10-00

Last Pass
---------
Build was on "2019-09-08_00-10-00

Timestamp/Logs
--------------
2019-09-09T15:42:20.000

Test Activity
-------------
Regression test

See original description

Tags:

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-09:

collect logs Edit (75.8 MiB, application/x-tar)

Ghada Khalil (gkhalil) on 2019-09-11

summary:	- IPV6:compute-2 critical 'kubelet' process has failed + IPV6: compute-2 critical 'kubelet' process has failed
description:	updated
description:	updated

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-09-11: Re: IPV6: compute-2 critical 'kubelet' process has failed

See these logs in compute-2 daemon log. Still unclear why kubelet is failing:

2019-09-09T20:36:23.832 compute-2 kubelet[26386]: info F0909 20:36:23.832769 26386 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error kubelet config file "/var/lib/kubelet/config.yaml" was empty
2019-09-09T20:36:23.838 compute-2 systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a
2019-09-09T20:36:23.865 compute-2 systemd[1]: notice Unit kubelet.service entered failed state.
2019-09-09T20:36:23.865 compute-2 systemd[1]: warning kubelet.service failed.

2019-09-09T20:37:24.056 compute-2 kubelet[50919]: info F0909 20:37:24.056760 50919 server.go:273] failed to run Kubelet: failed to initialize client certificate manager: could not convert data from "/var/lib/kubelet/pki/kubelet-client-current.pem" into cert/key pair: tls: failed to find any PEM data in certificate input
2019-09-09T20:37:24.061 compute-2 systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a
2019-09-09T20:37:24.077 compute-2 systemd[1]: notice Unit kubelet.service entered failed state.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-13:

Marking as stx.3.0 for now; also need to get a better understanding of reproducibility

tags:	added: stx.containers
tags:	added: stx.3.0
summary:	- IPV6: compute-2 critical 'kubelet' process has failed + IPV6: compute-2 in reboot loop due to critical 'kubelet' process failure
Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Bart Wensley (bartwensley)

Ghada Khalil (gkhalil) on 2019-09-13

tags:

added: stx.retestneeded

Frank Miller (sensfan22) on 2019-09-19

Changed in starlingx:
assignee:	Bart Wensley (bartwensley) → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-23:

Issue is not seen in the latest installs of the following IPV6 labs

cgcs-wildcat-35_60
cgcs-wolfpass-03_07
cgcs-wildcat-71_75
cgcs-wolfpass-08_12

fm event-list -q event_log_id=200.006

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-24:

The kubelet process is seen to fail immediately following the enable in this lab in this load.

Questions to PV:

1. Has this issue been reproduced in any other lab or any other load since this case occurred ?

2. Would it be possible to run a lock/unlock soak on computes in an IPV6 lab ?

3. Has this issue been seen in an IPV4 lab ?

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-25:

My answers.

1. Has this issue been reproduced in any other lab or any other load since this case occurred ?
I haven't seen this issue after this .

2. Would it be possible to run a lock/unlock soak on computes in an IPV6 lab ?
We can try a soak .

3. Has this issue been seen in an IPV4 lab ?
I haven't heard any of this issue from IPv4 lab.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-25:

Thank you. Lets move forward with the lock/unlock soak once the new version of kubernetes is in the load. I'll update the LP when that is merged.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-09-26:

A decision was made in today's scrum to move forward with the lock/unlock soak with the current version of kublet ; not wait for the new version as I suggested.

If the failure does occur please record how many lock and unlocks it took and leave the system in that state for debug.

PV, Please start the lock/unlock soak and make an update to this LP when that test is starting and what lab it is running in.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-30: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/685773

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-01: Fix merged to stx-puppet (master)

#10

Reviewed: https://review.opendev.org/685773
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=79525f2edd56df142de22f6f2ed7fd9ff8577840
Submitter: Zuul
Branch: master

commit 79525f2edd56df142de22f6f2ed7fd9ff8577840
Author: Eric MacDonald <email address hidden>
Date: Mon Sep 30 14:34:38 2019 -0400

Add filesystem dependency to kubernetes worker init class

    There is a race condition that can lead to kubeadm join
    being called before the '/var/lib/kubelet' mount point
    is created.

    An early run of the join executes ok leaving expected
    config and otherwise content in the local /var/lib/kubelet
    directory.

    Then when the filesystem manifests creates the
    /var/lib/kubelet mountpoint the kubernetes config content
    created earlier gets hidden by the mount thereby causing
    a non-recoverable kubelet process failure due to missing
    config.yaml

    This update adds a filesystem class dependency in the
    kubernetes::worker:init class to ensure the
    /var/lib/kubelet mountpoint is created before the
    kubeadm join is run.

Test Plan:

PASS: Verify Normal System Install

    Change-Id: Ibc110589260c23f86beb5a6eaf1008b3d4f387b3
    Closes-Bug: 1843344
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-25:

#11

Sudo kill kubelet process many times on worker node to reboot and verified recovery and alarm.

Verified in load 2019-10-24_15-48-12.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

collect logs Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.