Distributed Cloud: Subsequent subcloud bootstrapping may cause the kubelet to fail.

Bug #1849710 reported by Yosief Gebremariam
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Al Bailey

Bug Description

Brief Description
-----------------
DC system install, attempting to bootstrap a subcloud from the System Controller, after initial non-trivial bootstrap failure, may cause issues on the Kubernetes resulting in failure on the subcloud bootstrapping.

Initially, the bootstrap failed due to SSH Error:
TASK [bootstrap/prepare-env : Look for unmistakenly StarlingX package] *********
fatal: [subcloud6]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"2620:10a:a001:a103:a6bf:1ff:fe0a:7af6\". Make sure this host can be reached over ssh", "unreachable": true}

After correcting the OAM address, subsequent attempts to re-add the subcloud failed. Please see the sample logs below.

+----+-----------+------------+--------------+------------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+------------------+-------------+
| 5 | subcloud4 | managed | online | complete | out-of-sync |
| 6 | subcloud5 | managed | online | complete | in-sync |
| 9 | subcloud1 | managed | online | complete | in-sync |
| 11 | subcloud6 | unmanaged | offline | bootstrap-failed | unknown |
+----+-----------+------------+--------------+------------------+-------------+

The workaround is to manually run the "kubeadmin init" from the subcloud to restart the kubelet.

Severity
--------
Major

Steps to Reproduce
------------------
as description

TC-name: DC system AIO SX subcloud installation

Expected Behavior
------------------
The subcloud bootstrap from System Controller to complete successfully

Actual Behavior
----------------
The bootstrap failed with kubernetes failure

Reproducibility
---------------
Tested once.

System Configuration
--------------------
One node system

Lab-name: DC System subcloud6 AIO-SX (WCP-89)

Branch/Pull Time/Commit
-----------------------
"2019-10-22_20-00-00"

Last Pass
---------

Timestamp/Logs

fatal: [subcloud6]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--config=/etc/kubernetes/kubeadm.yaml"], "delta": "0:01:56.961250", "end": "2019-10-24 02:18:38.869237", "msg": "non-zero return code", "rc": 1, "start": "2019-10-24 02:16:41.907987", "stderr": "\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["\t[WARNING IsDockerSystemdCheck]: detected \"cgroupfs\" as the Docker cgroup driver. The recommended driver is \"systemd\". Please follow the guide at https://kubernetes.io/docs/setup/cri/", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.16.2\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"\n[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"\n[kubelet-start] Activating the kubelet service\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\n[certs] Generating \"ca\" certificate and key\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [controller-0 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [fd04::1 aefd::2 aefd::2 ::1]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\n[certs] Generating \"front-proxy-ca\" certificate and key\n[certs] Generating \"front-proxy-client\" certificate and key\n[certs] External etcd mode: Skipping etcd/ca certificate authority generation\n[certs] External etcd mode: Skipping etcd/server certificate generation\n[certs] External etcd mode: Skipping etcd/peer certificate generation\n[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation\n[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation\n[certs] Generating \"sa\" key and public key\n[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"\n[kubeconfig] Writing \"admin.conf\" kubeconfig file\n[kubeconfig] Writing \"kubelet.conf\" kubeconfig file\n[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file\n[kubeconfig] Writing \"scheduler.conf\" kubeconfig file\n[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"\n[control-plane] Creating static Pod manifest for \"kube-apiserver\"\n[control-plane] Creating static Pod manifest for \"kube-controller-manager\"\n[control-plane] Creating static Pod manifest for \"kube-scheduler\"\n[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory \"/etc/kubernetes/manifests\". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n[kubelet-check] It seems like the kubelet isn't running or healthy.\n[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp 127.0.0.1:10248: connect: connection refused.\n\nUnfortunately, an error has occurred:\n\ttimed out waiting for the condition\n\nThis error is likely caused by:\n\t- The kubelet is not running\n\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)\n\nIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:\n\t- 'systemctl status kubelet'\n\t- 'journalctl -xeu kubelet'\n\nAdditionally, a control plane component may have crashed or exited when started by the container runtime.\nTo troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.\nHere is one example how you may list all Kubernetes containers running in docker:\n\t- 'docker ps -a | grep kube | grep -v pause'\n\tOnce you have found the failing container, you can inspect its logs with:\n\t- 'docker logs CONTAINERID'", "stdout_lines": ["[init] Using Kubernetes version: v1.16.2", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"", "[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"", "[kubelet-start] Activating the

Test Activity
-------------
DC System installation

Revision history for this message
Yosief Gebremariam (ygebrema) wrote :
summary: - Distributed Cloud: Subsequent bootstrapping on subcloud may cause the
+ Distributed Cloud: Subsequent bootstrapping subcloud may cause the
kubelet to fail.
summary: - Distributed Cloud: Subsequent bootstrapping subcloud may cause the
+ Distributed Cloud: Subsequent subcloud bootstrapping may cause the
kubelet to fail.
Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
Tyler Smith (tyler.smith) wrote :

not sure why the kubeadm init failed initially, when I ran it manually it passed with no issue… Don’t think it’s tied to the earlier ssh failure (or the second failure which was during image pull). Also worth noting that deleting the subcloud in dcmanager doesn’t actually do any cleanup on the subcloud, just removes it from the database and lets you rerun the bootstrap

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tyler Smith (tyler.smith)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Waiting to see if the issue is reproducible

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Reading this more closely, this appears to be an issue after an initial bootstrap failure, so that's the scenario that needs to be tested when trying to reproduce.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

The same issue is reproduced in bootstrapping the same subcloud. The initial failure was on image pull and the second was on kubeadmin init. Collected logs are attached.
[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud list
+----+-----------+------------+--------------+------------------+-------------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+------------------+-------------+
| 5 | subcloud4 | managed | online | complete | out-of-sync |
| 6 | subcloud5 | managed | online | complete | in-sync |
| 9 | subcloud1 | managed | online | complete | out-of-sync |
| 19 | subcloud6 | unmanaged | offline | bootstrap-failed | unknown |
+----+-----------+------------+--------------+------------------+-------------+

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - issue is reproducible and impacts bootstrap replay for the subcloud.

tags: added: stx.3.0 stx.containers
Changed in starlingx:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Yang Liu (yliu12) wrote :

Comments from Al Bailey:

umount /var/lib/kubelet/ ; ls -al /var/lib/kubelet/
total 16
drwxr-x---. 2 root root 4096 Oct 25 23:57 .
drwxr-xr-x. 63 root root 4096 Oct 25 22:55 ..
-rw-r--r-- 1 root root 1790 Oct 25 23:57 config.yaml
-rw-r--r-- 1 root root 138 Oct 25 23:57 kubeadm-flags.env

User logs show:
2019-10-25T23:53:57.000 localhost ansible-command: info Invoked with warn=True executable=None _uses_shell=False _raw_params=kubeadm reset -f removes=None argv=None creates=None chdir=None stdin=None
2019-10-25T23:53:58.000 localhost ansible-command: info Invoked with warn=False executable=None _uses_shell=True _raw_params=/bin/rm -rf /var/lib/kubelet/* removes=None argv=None creates=None chdir=None stdin=None

There is no sign of the mount command from ansible having been invoked:

This is in the ansible log but it looks like it must have skipped:
TASK [bootstrap/persist-config : Mount kubelet-lv] *****************************

Here’s what the code looks like:
- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

Later on, it is mounted, but by then the data was already written to the raw disk.

2019-10-25T23:57:38.048 localhost systemd[1]: notice var-lib-kubelet.mount: Directory /var/lib/kubelet to mount over is not empty, mounting anyway.
2019-10-25T23:57:38.064 localhost systemd[1]: info Mounting /var/lib/kubelet...
2019-10-25T23:57:38.105 localhost systemd[1]: info Mounted /var/lib/kubelet.

The “kubeadm reset” completed.
The Mount was skipped. That is (in my opinion) the bug. We know it was unmounted, so it should not have been skipped.
The rm was successful.
Then the files are re-created (on the unmounted filesystem. A symptom of the bug)
The code re-mounts over top of the populated file system
k8s can not start.

Here’s the relevant ansible code:

- name: Mount kubelet-lv
    command: mount /var/lib/kubelet/
    args:
      warn: false
    when: '"/var/lib/kubelet/" is not mount'

In this environment, the “when” clause didn’t work. Maybe it’s due to caching

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Tyler Smith (tyler.smith) → Al Bailey (albailey1974)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/691734
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=38b20a77b4884e3053795e77acd91d447de14a26
Submitter: Zuul
Branch: master

commit 38b20a77b4884e3053795e77acd91d447de14a26
Author: Al Bailey <email address hidden>
Date: Mon Oct 28 13:22:29 2019 -0500

    Fix for missing mount when kubeadm init invoked

    When kubeadm reset is invoked, the /var/lib/kubelet
    mount is removed.

    The "when" conditional in the ansible environment did not seem
    to detect that the mount command was needed.

    Updated the command to be a compound bash command to ensure
    it works in local and remote ansible environments.

    Change-Id: Ic81d180df9691161e0b75ab7cdeaf2f639b47728
    Fixes-Bug: 1849710
    Related-Bug: 1847147
    Related-Bug: 1838692
    Signed-off-by: Al Bailey <email address hidden>

Changed in starlingx:
status: Triaged → Fix Released
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

Tested the same scenario on DC lab installed with build: "2019-11-05_07-34-20":
After initial simulated failure, a subsequent replay successfully completed the subcloud bootstrapping.

Yang Liu (yliu12)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.