StarlingX

IPv6: ansible playbook bootstrap failed by kubelet isn't running or healthy

Bug #1847147 reported by Peng Peng on 2019-10-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Al Bailey

Bug Description

Brief Description
-----------------
Install DC system-controller controller-0, run ansible-playbook bootstrap failed. The error msg shows,
It seems like the kubelet isn't running or healthy.", "[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.",

Severity
--------
Critical

Steps to Reproduce
------------------
as description

TC-name: DC installation

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
intermittent (failed 2 of 3)

System Configuration
--------------------
DC system
IPv6

Lab-name: WCP_90

Branch/Pull Time/Commit
-----------------------
19.10 master as of 2019-10-06_20-00-00

Last Pass
---------
2019-09-30_20-00-00

Timestamp/Logs
--------------
TASK [bootstrap/bringup-essential-services : Initializing Kubernetes master] ********************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:01:57.697315", "end": "2019-10-07 19:48:22.425851", "msg": "non-zero return code", "rc": 1, "start": "2019-10-07 19:46:24.728536", "stderr": "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.16.0\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Activating the kubelet service\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"ca\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [controller-0 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [fd04::1 aefd::2 aefd::2 ::1]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Generating \"front-proxy-ca\" certificate and key\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] External etcd mode: Skipping etcd/ca certificate authority generation\n[certs] External etcd mode: Skipping etcd/server certificate generation\n[certs] External etcd mode: Skipping etcd/peer certificate generation\n[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation\n[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation\n[certs] Generating \"sa\" key and public key\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"kubelet.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.

It seems kubelet keep restart

yow-cgcs-wildcat-90:~$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kubeadm.conf
   Active: active (running) since Mon 2019-10-07 20:20:08 UTC; 3min 23s ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 193679 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 193670 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
Main PID: 193678 (kubelet)
    Tasks: 17
   Memory: 33.0M
   CGroup: /system.slice/kubelet.service
           └─193678 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupf...
yow-cgcs-wildcat-90:~$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kubeadm.conf
   Active: active (running) since Mon 2019-10-07 20:24:01 UTC; 11s ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 196594 ExecStopPost=/bin/rm -f /var/run/kubelet.pid (code=exited, status=0/SUCCESS)
  Process: 197931 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 197901 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
Main PID: 197930 (kubelet)
    Tasks: 15
   Memory: 27.5M
   CGroup: /system.slice/kubelet.service
           └─197930 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupf...
yow-cgcs-wildcat-90:~$

Test Activity
-------------
DC installation

See original description

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-07:

ansible.log Edit (292.8 KiB, text/plain)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-07:

ALL_NODES_20191007.200851.tar Edit (28.3 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-07:

Reinstalled the same load and run ansible bootstrap again. This time it is passed. It is an intermittent issue.

description:

updated

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-08:

There is a kubernetes bug reported that sounds similar:
https://github.com/kubernetes/kubernetes/issues/82988

that indicates --cgroup-driver=systemd as a possible fix.

It should also get rid of this warning:

WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-08:

As per Tao Liu, this is not specific to Distributed Cloud.

tags:

added: stx.containers

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-08:

Same issue was also reproduced on WCP_71-75.
load 2019-10-07_20-00-00

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-08:

Marking as stx.3.0 / high priority - k8s related; results in installation failures

Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
assignee:	nobody → Al Bailey (albailey1974)
tags:	added: stx.3.0

Yang Liu (yliu12) on 2019-10-08

summary:

- IPv6 Distributed Cloud: ansible playbook bootstrap failed by kubelet
- isn't running or healthy
+ IPv6: ansible playbook bootstrap failed by kubelet isn't running or
+ healthy

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-10-08:

There is no evidence in pmond.logs that kubelet ever failed.

kubelet is running now and is responding to the health check with ok

curl -sSL http://localhost:10248/healthz
ok

C0 has been up for 21 minutes.

[sysadmin@controller-0 ~(keystone_admin)]$ systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
   Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─kube-stx-override.conf, kubeadm.conf
   Active: active (running) since Tue 2019-10-08 18:36:58 UTC; 9min ago
     Docs: https://kubernetes.io/docs/concepts/overview/components/#kubelet
           https://kubernetes.io/docs/reference/generated/kubelet/
  Process: 142074 ExecStopPost=/bin/rm -f /var/run/kubelet.pid (code=exited, status=0/SUCCESS)
  Process: 142086 ExecStartPost=/bin/bash -c echo $MAINPID > /var/run/kubelet.pid; (code=exited, status=0/SUCCESS)
  Process: 142079 ExecStartPre=/usr/bin/kubelet-cgroup-setup.sh (code=exited, status=0/SUCCESS)
Main PID: 142085 (kubelet)
    Tasks: 77
   Memory: 45.9M
   CGroup: /system.slice/kubelet.service
           └─142085 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=cgroupfs --network-plugin=cni --pod-infra-container-image=registry....

[sysadmin@controller-0 ~(keystone_admin)]$ uptime
18:47:05 up 21 min, 3 users, load average: 2.35, 1.26, 0.74

Based on this, it looks like we need to go back in time and see what went wrong since its no longer in that bad state.

Revision history for this message

Tao Liu (tliu88) wrote on 2019-10-08:

The failure occurred during the bootstrap and the logs were in ansible.log.

It looks like Kubelet failed to start due to missing config file. It retried many times. In the DC lab, in one case, it failed after many retries (perhaps timeout). In another install (with the same load), the process started OK after retries.

2019-10-08T13:18:27.402 localhost systemd[1]: info Started Kubernetes Kubelet Server.
2019-10-08T13:18:27.507 localhost kubelet[234370]: info F1008 13:18:27.507355 234370 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
2019-10-08T13:18:27.510 localhost systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-09:

#10

The following type of error is shown over and over in daemon.log

2019-10-07T19:09:47.183 localhost kubelet[20437]: info F1007 19:09:47.183783 20437 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
2019-10-07T19:09:47.189 localhost systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a
2019-10-07T19:09:47.205 localhost systemd[1]: notice Unit kubelet.service entered failed state.
2019-10-07T19:09:47.205 localhost systemd[1]: warning kubelet.service failed.

....

2019-10-07T19:48:18.298 localhost kubelet[158287]: info F1007 19:48:18.298759 158287 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

Our collect script does not include config.yaml

Note: looking at the file on the same machine, that is now working, these two addresses do not look like ipv6
address: 0.0.0.0
healthzBindAddress: 127.0.0.1

A later install in the same lab worked. Oddly, the same files are missing but the error code has changed.

2019-10-08T17:31:44.885 localhost kubelet[20411]: info F1008 17:31:44.885450 20411 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

.. a little later..

2019-10-08T18:17:29.638 localhost kubelet[148940]: info F1008 18:17:29.638164 148940 server.go:271] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

and then it starts working
2019-10-08T18:17:39.745 localhost systemd[1]: info Started Kubernetes Kubelet Server.
2019-10-08T18:17:39.847 localhost kubelet[149005]: info Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2019-10-08T18:17:39.849 localhost kubelet[149005]: info Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
2019-10-08T18:17:39.869 localhost systemd[1]: info Started Kubernetes systemd probe.
2019-10-08T18:17:39.870 localhost kubelet[149005]: info I1008 18:17:39.869957 149005 server.go:410] Version: v1.16.0

The following type of error is shown over and over in daemon.log

2019-10-07T19:09:47.183 localhost kubelet[20437]: info F1007 19:09:47.183783   20437 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
2019-10-07T19:09:47.189 localhost systemd[1]: notice kubelet.service: main process exited, code=exited, status=255/n/a
2019-10-07T19:09:47.205 localhost systemd[1]: notice Unit kubelet.service entered failed state.
2019-10-07T19:09:47.205 localhost systemd[1]: warning kubelet.service failed.

....

2019-10-07T19:48:18.298 localhost kubelet[158287]: info F1007 19:48:18.298759  158287 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

Our collect script does not include config.yaml

Note: looking at the file on the same machine, that is now working, these two addresses do not look like ipv6 
   address: 0.0.0.0
   healthzBindAddress: 127.0.0.1

A later install in the same lab worked.  Oddly, the same files are missing but the error code has changed.

2019-10-08T17:31:44.885 localhost kubelet[20411]: info F1008 17:31:44.885450   20411 server.go:196] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

.. a little later..

2019-10-08T18:17:29.638 localhost kubelet[148940]: info F1008 18:17:29.638164  148940 server.go:271] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-09:

#11

health check: http://localhost:10248/healthz works fine in the lab setup for ipv6 at the moment, so that is a red herring.

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-09:

#12

Debugging on a system that encountered the problem has revealed what the bug is.

The problem is very similar to https://bugs.launchpad.net/starlingx/+bug/1843344
That bug affected worker nodes, but the controllers have the same mount-timing issue.

A similar fix will be implemented for these types of hosts.

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-09:

#13

This issue is not the same as the worker bug.
The issue here seems to be after the puppet phase. Something with lvm.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix proposed to ansible-playbooks (master)

#14

Fix proposed to branch: master
Review: https://review.opendev.org/687943

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Al Bailey (albailey1974) wrote on 2019-10-10:

#15

Ansible team is requesting that the change be abandoned, and David cherry pick his change to master
https://bugs.launchpad.net/starlingx/+bug/1838692

Yang Liu (yliu12) on 2019-10-12

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-15: Fix merged to ansible-playbooks (master)

#16

Reviewed: https://review.opendev.org/687943
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=e2968044bda39c98bc74ce29a4fbcdb010e502b5
Submitter: Zuul
Branch: master

commit e2968044bda39c98bc74ce29a4fbcdb010e502b5
Author: David Sullivan <email address hidden>
Date: Thu Aug 8 12:49:16 2019 -0400

ansible replay fails if kubeadm init was not successful

    This is a cherry-pick of:
    https://review.opendev.org/#/c/678649/
    with the Change Id and Bug Id updated.

k8s 1.16 re-introduced this bug.

Also, a kube reset triggers the unmount before it does its rm,
which means the rm in ansible is a requirement.

    The cherry pick had to be changed because the rm is required,
    and the 'when' clause is applied per task in the block
    rather than the entire block.

Original commit message text:

Ansible replay is broken since the introduction of kubelet-fs
https://opendev.org/starlingx/config/commit/e74ef5f7c4c71464347142932615e1545884133c

The kubelet-fs change uncovered a flaw in kubeadm reset
https://github.com/kubernetes/kubeadm/issues/1294

kubeadm will sometimes unmount /var/lib/kubelet/. To correct this we
will remount the kubelet-lv on ansible replays and wipe the contents.

    Change-Id: Ief5990784e6af5ac01612552aa7a11b1493ac7f6
    Closes-Bug: 1847147
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-10-16:

#17

Not seeing this issue on load:"2019-10-15_20-00-00"

tags:

removed: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-29: Related fix merged to ansible-playbooks (master)

#18

Reviewed: https://review.opendev.org/691734
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=38b20a77b4884e3053795e77acd91d447de14a26
Submitter: Zuul
Branch: master

commit 38b20a77b4884e3053795e77acd91d447de14a26
Author: Al Bailey <email address hidden>
Date: Mon Oct 28 13:22:29 2019 -0500

Fix for missing mount when kubeadm init invoked

When kubeadm reset is invoked, the /var/lib/kubelet
mount is removed.

The "when" conditional in the ansible environment did not seem
to detect that the mount command was needed.

Updated the command to be a compound bash command to ensure
it works in local and remote ansible environments.

    Change-Id: Ic81d180df9691161e0b75ab7cdeaf2f639b47728
    Fixes-Bug: 1849710
    Related-Bug: 1847147
    Related-Bug: 1838692
    Signed-off-by: Al Bailey <email address hidden>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-kubernetes-kubernetes #82988 Edit
auto-github-kubernetes-kubeadm #1294
[closed kind/bug priority/awaiting-more-evidence] Edit

Bug watches keep track of this bug in other bug trackers.