ansible replay fails if kubeadm init was not successful

Bug #1838692 reported by David Sullivan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
David Sullivan

Bug Description

Brief Description
-----------------
If the first attempt to run the ansible playbook fails to complete kubeadm init, subsequent runs will always fail.

Severity
--------
Critical

Steps to Reproduce
------------------
Bring up the system and run the ansible bootstrap playbook with a unified docker registry that is invalid/not reachable from the host
The playbook will fail (this is expected) - the error should look like this:
Some fatal errors occurred:\n\t[ERROR ImagePull]: failed to pull image 10.10.10.13:5000/kube-apiserver:v1.13.5: output: Error response from daemon: Get https://10.10.10.13:5000/v2/: dial tcp 10.10.10.13:5000: connect: no route to host
Remove the invalid docker registry and rerun the bootstrap playbook

Expected Behavior
------------------
The ansible playbook should complete.

Actual Behavior
----------------
The ansible playbook fails with the following error:
[kubelet-check] It seems like the kubelet isn't running or healthy

Reproducibility
---------------
100%

System Configuration
--------------------
All configurations

Branch/Pull Time/Commit
-----------------------
20190728T233000Z

Last Pass
---------
I believe this passed with 20190707T233000Z

Timestamp/Logs
--------------
NA

Test Activity
-------------
Developer Testing

Revision history for this message
David Sullivan (dsullivanwr) wrote :

I used the following data for localhost.yml to induce the first error in vbox

external_oam_subnet: 10.10.10.0/24
external_oam_gateway_address: 10.10.10.1
external_oam_floating_address: 10.10.10.2
management_subnet: 192.168.204.0/24
dns_servers:
  - 8.8.4.4
admin_password: Li69nux*
ansible_become_pass: Li69nux*
docker_registries:
  unified: 10.10.10.13:5000

Revision history for this message
David Sullivan (dsullivanwr) wrote :

I believe this issue stems from the addition of the kubelet-fs
https://opendev.org/starlingx/config/commit/e74ef5f7c4c71464347142932615e1545884133c

Digging into the issue kubelet fails to start because the config is missing.
localhost kubelet[103405]: info F0731 20:52:58.914248 103405 server.go:189] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

As part of the kubelet-fs we mount the fs to /var/lib/kubelet. Unfortunately kubeadm reset will unmount this path. This means on replays the following will happen:
kubeadm reset -> unmounts kubelet-lv from /var/lib/kubelet
kubeadm init -> writes config to /var/lib/kubelet
kubeadm init -> starts kubelet
kubelet service -> mounts kubelet-lv to /var/lib/kubelet
kubelet service -> reads in config.yaml

In the case where the first run of kubeadm init failed, the kubelet-lv will be empty and the kubelet service will fail. If the first kubeadm init passed the contents of /var/lib/kubelet will be stale.

Probably a quick solution would be to remount kubelet-lv to /var/lib/kubelet between kubeadm reset and kubeadm init

See also
https://github.com/kubernetes/kubernetes/blob/v1.13.5/cmd/kubeadm/app/cmd/reset.go#L157

Revision history for this message
David Sullivan (dsullivanwr) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

For stx.2.0 we cannot upversion to kubernetes v1.15. Therefore a temporary solution is required in stx.2.0 that can be removed in stx.3.0 when the rebase to kunvernetes is completed. Assigning to David to implement the temporary solution for stx.2.0.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Sullivan (dsullivanwr)
tags: added: stx.2.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675407

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/678649

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by David Sullivan (<email address hidden>) on branch: master
Review: https://review.opendev.org/675407

Revision history for this message
David Sullivan (dsullivanwr) wrote :

This change is no longer needed in master, only stx2.0.

Tested in master with this commit (https://opendev.org/starlingx/integ/commit/70a510bd0442f6e8c4e742d27ab4188725e3c96a) and the issue is resolved.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (r/stx.2.0)

Reviewed: https://review.opendev.org/678649
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=7dc747071a50cf29535cfa55ed9a33776557f925
Submitter: Zuul
Branch: r/stx.2.0

commit 7dc747071a50cf29535cfa55ed9a33776557f925
Author: David Sullivan <email address hidden>
Date: Thu Aug 8 12:49:16 2019 -0400

    ansible replay fails if kubeadm init was not successful

    Ansible replay is broken since the introduction of kubelet-fs
    https://opendev.org/starlingx/config/commit/e74ef5f7c4c71464347142932615e1545884133c

    The kubelet-fs change uncovered a flaw in kubeadm reset
    https://github.com/kubernetes/kubeadm/issues/1294

    kubeadm will sometimes unmount /var/lib/kubelet/. To correct this we
    will remount the kubelet-lv on ansible replays and wipe the contents.

    Change-Id: Ie4d6009bb2d53561586a8b62d1ab92a0859119fb
    Signed-off-by: David Sullivan <email address hidden>
    Closes-Bug: 1838692

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Al Bailey (albailey1974) wrote :

It looks like it is needed in master, since kube 1.16 changed how it does the unmount and it now unmounting /var/lib/kubelet.

See: https://review.opendev.org/#/q/topic:bug/1847147

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ansible-playbooks (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/691734

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/691734
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=38b20a77b4884e3053795e77acd91d447de14a26
Submitter: Zuul
Branch: master

commit 38b20a77b4884e3053795e77acd91d447de14a26
Author: Al Bailey <email address hidden>
Date: Mon Oct 28 13:22:29 2019 -0500

    Fix for missing mount when kubeadm init invoked

    When kubeadm reset is invoked, the /var/lib/kubelet
    mount is removed.

    The "when" conditional in the ansible environment did not seem
    to detect that the mount command was needed.

    Updated the command to be a compound bash command to ensure
    it works in local and remote ansible environments.

    Change-Id: Ic81d180df9691161e0b75ab7cdeaf2f639b47728
    Fixes-Bug: 1849710
    Related-Bug: 1847147
    Related-Bug: 1838692
    Signed-off-by: Al Bailey <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.