IPv6: ansible playbook bootstrap failed by kubelet isn't running or healthy

Bug #1847147 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Al Bailey

Bug Description

Brief Description
-----------------
Install DC system-controller controller-0, run ansible-playbook bootstrap failed. The error msg shows,
It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.",

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name: DC installation

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
intermittent (failed 2 of 3)

System Configuration
--------------------
DC system
IPv6

Lab-name: WCP_90

Branch/Pull Time/Commit
-----------------------
19.10 master as of 2019-10-06_20-00-00

Last Pass
---------
2019-09-30_20-00-00

Timestamp/Logs
--------------
TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ********************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:01:57.697315", "end": "2019-10-07 19:48:22.425851", "msg": "non-zero return code", "rc": 1, "start": "2019-10-07 19:46:24.728536", "stderr": "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.16.0\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Activating the kubelet service\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"ca\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [controller-0 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [fd04::1 aefd::2 aefd::2 ::1]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Generating \"front-proxy-ca\" certificate and key\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] External etcd mode: Skipping etcd/ca certificate authority generation\n[certs] External etcd mode: Skipping etcd/server certificate generation\n[certs] External etcd mode: Skipping etcd/peer certificate generation\n[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation\n[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation\n[certs] Generating \"sa\" key and public key\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"kubelet.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.

It seems kubelet keep restart

yow-cgcs-wildcat-90:~$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kubeadm.conf
   Active: active (running) since Mon 2019-10-07 20:20:08 UTC; 3min 23s ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 193679 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 193670 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
 Main PID: 193678 (kubelet)
    Tasks: 17
   Memory: 33.0M
   CGroup: /system.slice/kubelet.service
           └─193678 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupf...
yow-cgcs-wildcat-90:~$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kubeadm.conf
   Active: active (running) since Mon 2019-10-07 20:24:01 UTC; 11s ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 196594 ExecStopPost=/bin/rm -f /var/run/kubelet.pid (code=exited, status=0/SUCCESS)
  Process: 197931 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 197901 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
 Main PID: 197930 (kubelet)
    Tasks: 15
   Memory: 27.5M
   CGroup: /system.slice/kubelet.service
           └─197930 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupf...
yow-cgcs-wildcat-90:~$

Test Activity
-------------
DC installation

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :

Reinstalled the same load and run ansible bootstrap again. This time it is passed. It is an intermittent issue.

description: updated
Revision history for this message
Al Bailey (albailey1974) wrote :

There is a kubernetes bug reported that sounds similar:
https://github.com/kubernetes/kubernetes/issues/82988

that indicates --cgroup-driver=systemd as a possible fix.

It should also get rid of this warning:

WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Tao Liu, this is not specific to Distributed Cloud.

tags: added: stx.containers
Revision history for this message
Peng Peng (ppeng) wrote :

Same issue was also reproduced on WCP_71-75.
load 2019-10-07_20-00-00

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / high priority - k8s related; results in installation failures

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Al Bailey (albailey1974)
tags: added: stx.3.0
Yang Liu (yliu12)
summary: - IPv6 Distributed Cloud: ansible playbook bootstrap failed by kubelet
- isn't running or healthy
+ IPv6: ansible playbook bootstrap failed by kubelet isn't running or
+ healthy
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

There is no evidence in pmond.logs that kubelet ever failed.

kubelet is running now and is responding to the health check with ok

curl -sSL http://localhost:10248/healthz
ok

C0 has been up for 21 minutes.

[sysadmin@controller-0 ~(keystone_admin)]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kube-stx-override.conf, kubeadm.conf
   Active: active (running) since Tue 2019-10-08 18:36:58 UTC; 9min ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 142074 ExecStopPost=/bin/rm -f /var/run/kubelet.pid (code=exited, status=0/SUCCESS)
  Process: 142086 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 142079 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
 Main PID: 142085 (kubelet)
    Tasks: 77
   Memory: 45.9M
   CGroup: /system.slice/kubelet.service
           └─142085 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry....

[sysadmin@controller-0 ~(keystone_admin)]$ uptime
 18:47:05 up 21 min, 3 users, load average: 2.35, 1.26, 0.74

Based on this, it looks like we need to go back in time and see what went wrong since its no longer in that bad state.

Revision history for this message
Tao Liu (tliu88) wrote :

The failure occurred during the bootstrap and the logs were in ansible.log.

It looks like Kubelet failed to start due to missing config file. It retried many times. In the DC lab, in one case, it failed after many retries (perhaps timeout). In another install (with the same load), the process started OK after retries.

2019-10-08T13:18:27.402 localhost systemd[1]: info Started Kubernetes Kubelet Server.
2019-10-08T13:18:27.507 localhost kubelet[234370]: info F1008 13:18:27.507355 234370 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
2019-10-08T13:18:27.510 localhost systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a

Revision history for this message
Al Bailey (albailey1974) wrote :

The following type of error is shown over and over in daemon.log

2019-10-07T19:09:47.183 localhost kubelet[20437]: info F1007 19:09:47.183783 20437 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
2019-10-07T19:09:47.189 localhost systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a
2019-10-07T19:09:47.205 localhost systemd[1]: notice Unit kubelet.service entered failed state.
2019-10-07T19:09:47.205 localhost systemd[1]: warning kubelet.service failed.

....

2019-10-07T19:48:18.298 localhost kubelet[158287]: info F1007 19:48:18.298759 158287 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

Our collect script does not include config.yaml

Note: looking at the file on the same machine, that is now working, these two addresses do not look like ipv6
   address: 0.0.0.0
   healthzBindAddress: 127.0.0.1

A later install in the same lab worked. Oddly, the same files are missing but the error code has changed.

2019-10-08T17:31:44.885 localhost kubelet[20411]: info F1008 17:31:44.885450 20411 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

.. a little later..

2019-10-08T18:17:29.638 localhost kubelet[148940]: info F1008 18:17:29.638164 148940 server.go:271] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

and then it starts working
2019-10-08T18:17:39.745 localhost systemd[1]: info Started Kubernetes Kubelet Server.
2019-10-08T18:17:39.847 localhost kubelet[149005]: info Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2019-10-08T18:17:39.849 localhost kubelet[149005]: info Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2019-10-08T18:17:39.869 localhost systemd[1]: info Started Kubernetes systemd probe.
2019-10-08T18:17:39.870 localhost kubelet[149005]: info I1008 18:17:39.869957 149005 server.go:410] Version: v1.16.0

Revision history for this message
Al Bailey (albailey1974) wrote :

health check: http://localhost:10248/healthz works fine in the lab setup for ipv6 at the moment, so that is a red herring.

Revision history for this message
Al Bailey (albailey1974) wrote :

Debugging on a system that encountered the problem has revealed what the bug is.

The problem is very similar to https://bugs.launchpad.net/starlingx/+bug/1843344
That bug affected worker nodes, but the controllers have the same mount-timing issue.

A similar fix will be implemented for these types of hosts.

Revision history for this message
Al Bailey (albailey1974) wrote :

This issue is not the same as the worker bug.
The issue here seems to be after the puppet phase. Something with lvm.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687943

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Al Bailey (albailey1974) wrote :

Ansible team is requesting that the change be abandoned, and David cherry pick his change to master
https://bugs.launchpad.net/starlingx/+bug/1838692

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/687943
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=e2968044bda39c98bc74ce29a4fbcdb010e502b5
Submitter: Zuul
Branch: master

commit e2968044bda39c98bc74ce29a4fbcdb010e502b5
Author: David Sullivan <email address hidden>
Date: Thu Aug 8 12:49:16 2019 -0400

    ansible replay fails if kubeadm init was not successful

    This is a cherry-pick of:
    https://review.opendev.org/#/c/678649/
    with the Change Id and Bug Id updated.

    k8s 1.16 re-introduced this bug.

    Also, a kube reset triggers the unmount before it does its rm,
    which means the rm in ansible is a requirement.

    The cherry pick had to be changed because the rm is required,
    and the 'when' clause is applied per task in the block
    rather than the entire block.

    Original commit message text:

    Ansible replay is broken since the introduction of kubelet-fs
    https://opendev.org/starlingx/config/commit/e74ef5f7c4c71464347142932615e1545884133c

    The kubelet-fs change uncovered a flaw in kubeadm reset
    https://github.com/kubernetes/kubeadm/issues/1294

    kubeadm will sometimes unmount /var/lib/kubelet/. To correct this we
    will remount the kubelet-lv on ansible replays and wipe the contents.

    Change-Id: Ief5990784e6af5ac01612552aa7a11b1493ac7f6
    Closes-Bug: 1847147
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not seeing this issue on load:"2019-10-15_20-00-00"

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/691734
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=38b20a77b4884e3053795e77acd91d447de14a26
Submitter: Zuul
Branch: master

commit 38b20a77b4884e3053795e77acd91d447de14a26
Author: Al Bailey <email address hidden>
Date: Mon Oct 28 13:22:29 2019 -0500

    Fix for missing mount when kubeadm init invoked

    When kubeadm reset is invoked, the /var/lib/kubelet
    mount is removed.

    The "when" conditional in the ansible environment did not seem
    to detect that the mount command was needed.

    Updated the command to be a compound bash command to ensure
    it works in local and remote ansible environments.

    Change-Id: Ic81d180df9691161e0b75ab7cdeaf2f639b47728
    Fixes-Bug: 1849710
    Related-Bug: 1847147
    Related-Bug: 1838692
    Signed-off-by: Al Bailey <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.