StarlingX

Distributed Cloud: Subsequent subcloud bootstrapping may cause the kubelet to fail.

Bug #1849710 reported by Yosief Gebremariam on 2019-10-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Al Bailey

Bug Description

Brief Description
-----------------
DC system install, attempting to bootstrap a subcloud from the System Controller, after initial non-trivial bootstrap failure, may cause issues on the Kubernetes resulting in failure on the subcloud bootstrapping.

Initially, the bootstrap failed due to SSH Error:
TASK [bootstrap/prepare-env : Look for unmistakenly StarlingX package] *********
fatal: [subcloud6]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"2620:10a:a001:a103:a6bf:1ff:fe0a:7af6\". Make sure this host can be reached over ssh", "unreachable": true}

After correcting the OAM address, subsequent attempts to re-add the subcloud failed. Please see the sample logs below.

The workaround is to manually run the "kubeadmin init" from the subcloud to restart the kubelet.

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: DC system AIO SX subcloud installation

Expected Behavior
------------------
The subcloud bootstrap from System Controller to complete successfully

Actual Behavior
----------------
The bootstrap failed with kubernetes failure

Reproducibility
---------------
Tested once.

System Configuration
--------------------
One node system

Lab-name: DC System subcloud6 AIO-SX (WCP-89)

Branch/Pull Time/Commit
-----------------------
"2019-10-22_20-00-00"

Last Pass
---------

Timestamp/Logs

fatal: [subcloud6]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:01:56.961250", "end": "2019-10-24 02:18:38.869237", "msg": "non-zero return code", "rc": 1, "start": "2019-10-24 02:16:41.907987", "stderr": "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.16.2\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Activating the kubelet service\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"ca\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [controller-0 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [fd04::1 aefd::2 aefd::2 ::1]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Generating \"front-proxy-ca\" certificate and key\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] External etcd mode: Skipping etcd/ca certificate authority generation\n[certs] External etcd mode: Skipping etcd/server certificate generation\n[certs] External etcd mode: Skipping etcd/peer certificate generation\n[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation\n[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation\n[certs] Generating \"sa\" key and public key\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"kubelet.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n\nUnfortunately, an error has occurred:\n\ttimed out waiting for the condition\n\nThis error is likely caused by:\n\t- The kubelet is not running\n\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)\n\nIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:\n\t- 'systemctl status kubelet'\n\t- 'journalctl -xeu kubelet'\n\nAdditionally, a control plane component may have crashed or exited when started by the container runtime.\nTo troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.\nHere is one example how you may list all Kubernetes containers running in docker:\n\t- 'docker ps -a | grep kube | grep -v pause'\n\tOnce you have found the failing container, you can inspect its logs with:\n\t- 'docker logs CONTAINERID'", "stdout_lines": ["[init] Using Kubernetes version: v1.16.2", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Activating the

Test Activity
-------------
DC System installation

Tags:

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-10-24:

subcloud6_bootstrap_log.tar Edit (440.0 KiB, application/x-tar)

summary:	- Distributed Cloud: Subsequent bootstrapping on subcloud may cause the + Distributed Cloud: Subsequent bootstrapping subcloud may cause the kubelet to fail.
summary:	- Distributed Cloud: Subsequent bootstrapping subcloud may cause the + Distributed Cloud: Subsequent subcloud bootstrapping may cause the kubelet to fail.

Ghada Khalil (gkhalil) on 2019-10-25

tags:

added: stx.distcloud

Revision history for this message

Tyler Smith (tyler.smith) wrote on 2019-10-25:

not sure why the kubeadm init failed initially, when I ran it manually it passed with no issue… Don’t think it’s tied to the earlier ssh failure (or the second failure which was during image pull). Also worth noting that deleting the subcloud in dcmanager doesn’t actually do any cleanup on the subcloud, just removes it from the database and lets you rerun the bootstrap

Ghada Khalil (gkhalil) on 2019-10-25

Changed in starlingx:
assignee:	nobody → Tyler Smith (tyler.smith)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-25:

Waiting to see if the issue is reproducible

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-25:

Reading this more closely, this appears to be an issue after an initial bootstrap failure, so that's the scenario that needs to be tested when trying to reproduce.

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-10-26:

ALL_NODES_20191026.000605.tar Edit (87.3 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-28:

Marking as stx.3.0 / medium priority - issue is reproducible and impacts bootstrap replay for the subcloud.

tags:	added: stx.3.0 stx.containers
Changed in starlingx:
status:	Incomplete → Triaged
importance:	Undecided → Medium

Revision history for this message

Yang Liu (yliu12) wrote on 2019-10-28:

Comments from Al Bailey:

umount /var/lib/kubelet/ ; ls -al /var/lib/kubelet/
total 16
drwxr-x---. 2 root root 4096 Oct 25 23:57 .
drwxr-xr-x. 63 root root 4096 Oct 25 22:55 ..
-rw-r--r-- 1 root root 1790 Oct 25 23:57 config.yaml
-rw-r--r-- 1 root root 138 Oct 25 23:57 kubeadm-flags.env

User logs show:
2019-10-25T23:53:57.000 localhost ansible-command: info Invoked with warn=True executable=None _uses_shell=False _raw_params=kubeadm reset -f removes=None argv=None creates=None chdir=None stdin=None
2019-10-25T23:53:58.000 localhost ansible-command: info Invoked with warn=False executable=None _uses_shell=True _raw_params=/bin/rm -rf /var/lib/kubelet/* removes=None argv=None creates=None chdir=None stdin=None

There is no sign of the mount command from ansible having been invoked:

This is in the ansible log but it looks like it must have skipped:
TASK [bootstrap/persist-config : Mount kubelet-lv] *****************************

Here’s what the code looks like:
- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

Later on, it is mounted, but by then the data was already written to the raw disk.

2019-10-25T23:57:38.048 localhost systemd[1]: notice var-lib-kubelet.mount: Directory /var/lib/kubelet to mount over is not empty, mounting anyway.
2019-10-25T23:57:38.064 localhost systemd[1]: info Mounting /var/lib/kubelet...
2019-10-25T23:57:38.105 localhost systemd[1]: info Mounted /var/lib/kubelet.

The “kubeadm reset” completed.
The Mount was skipped. That is (in my opinion) the bug. We know it was unmounted, so it should not have been skipped.
The rm was successful.
Then the files are re-created (on the unmounted filesystem. A symptom of the bug)
The code re-mounts over top of the populated file system
k8s can not start.

Here’s the relevant ansible code:

- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

In this environment, the “when” clause didn’t work. Maybe it’s due to caching

Comments from Al Bailey:

umount /var/lib/kubelet/ ; ls -al /var/lib/kubelet/
total 16
drwxr-x---.  2 root root 4096 Oct 25 23:57 .
drwxr-xr-x. 63 root root 4096 Oct 25 22:55 ..
-rw-r--r--   1 root root 1790 Oct 25 23:57 config.yaml
-rw-r--r--   1 root root  138 Oct 25 23:57 kubeadm-flags.env

There is no sign of the mount command from ansible having been invoked:

This is in the ansible log but it looks like it must have skipped:
TASK [bootstrap/persist-config : Mount kubelet-lv] *****************************

Here’s what the code looks like:
- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

Later on, it is mounted, but by then the data was already written to the raw disk.

The “kubeadm reset” completed.
The Mount was skipped.  That is (in my opinion) the bug.    We know it was unmounted, so it should not have been skipped.
The rm was successful.
Then the files are re-created (on the unmounted filesystem.  A symptom of the bug)
The code re-mounts over top of the populated file system
k8s can not start.

Here’s the relevant ansible code:

- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

In this environment, the “when” clause didn’t work.   Maybe it’s due to caching

Frank Miller (sensfan22) on 2019-10-28

Changed in starlingx:
assignee:	Tyler Smith (tyler.smith) → Al Bailey (albailey1974)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-29: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/691734
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=38b20a77b4884e3053795e77acd91d447de14a26
Submitter: Zuul
Branch: master

commit 38b20a77b4884e3053795e77acd91d447de14a26
Author: Al Bailey <email address hidden>
Date: Mon Oct 28 13:22:29 2019 -0500

Fix for missing mount when kubeadm init invoked

When kubeadm reset is invoked, the /var/lib/kubelet
mount is removed.

The "when" conditional in the ansible environment did not seem
to detect that the mount command was needed.

Updated the command to be a compound bash command to ensure
it works in local and remote ansible environments.

    Change-Id: Ic81d180df9691161e0b75ab7cdeaf2f639b47728
    Fixes-Bug: 1849710
    Related-Bug: 1847147
    Related-Bug: 1838692
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status:	Triaged → Fix Released

Yang Liu (yliu12) on 2019-10-31

tags:

added: stx.retestneeded

Revision history for this message

Yosief Gebremariam (ygebrema) wrote on 2019-11-06:

Tested the same scenario on DC lab installed with build: "2019-11-05_07-34-20":
After initial simulated failure, a subsequent replay successfully completed the subcloud bootstrapping.

Yang Liu (yliu12) on 2019-11-06

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.