Containers: controllerconfig fails on controller-1 due to k8s taint missing

Bug #1815795 reported by Tyler Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Don Penney

Bug Description

Title
-----
Containers: controllerconfig fails on controller-1 due to k8s taint missing

Brief Description
-----------------
I just encountered an issue (2+2 pike load from Feb. 8th) where controllerconfig fails on controller-1 to remove the taint from the newly configured k8s node (not sure why the taint was not found), so the config fails. On second config attempt it tries to run through the same setup, but kubeadm init has already been run so it fails at that point due to the ports already being in use.

Locking and unlocking the failed node just keeps failing in the same way, running kubeadm reset and deleting the node before trying again fixes the issue though. Maybe we should execute that on a failed configure attempt to prepare it for a retry, or check to see if the init command has already succeeded before reattempting.

First failure logs:
2019-02-08T17:52:32.379 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]/returns: deployment.extensions/coredns patched^[[0m
2019-02-08T17:52:32.384 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]/returns: executed successfully^[[0m
2019-02-08T17:52:32.387 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]: The container Class[Platform::Kubernetes::Master::Init] will propagate my refresh event^[[0m
2019-02-08T17:52:32.398 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 Exec[remove taint from master node](provider=posix): Executing 'kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master-'^[[0m
2019-02-08T17:52:32.409 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 Executing: 'kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master-'^[[0m
2019-02-08T17:52:32.745 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[remove taint from master node]/returns: error: taint "node-role.kubernetes.io/master:" not found^[[0m
2019-02-08T17:52:32.764 ^[[1;31mError: 2019-02-08 17:52:32 +0000 kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master- returned 1 instead of one of [0]

Second failure logs:
2019-02-08T18:02:35.166 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Kubeadm/Exec[enable-kubelet]/returns: executed successfully^[[0m
2019-02-08T18:02:35.168 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Kubeadm/Exec[enable-kubelet]: The container Class[Platform::Kubernetes::Kubeadm] will propagate my refresh event^[[0m
2019-02-08T18:02:35.171 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Class[Platform::Kubernetes::Kubeadm]: The container Stage[main] will propagate my refresh event^[[0m
2019-02-08T18:02:35.173 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Class[Platform::Kubernetes::Kubeadm]: The container Class[Platform::Kubernetes::Master] will propagate my refresh event^[[0m
2019-02-08T18:02:35.177 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Exec[configure master node](provider=posix): Executing 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-02-08T18:02:35.180 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Executing: 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-02-08T18:02:35.618 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [init] using Kubernetes version: v1.12.3^[[0m
2019-02-08T18:02:35.621 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] running pre-flight checks^[[0m
2019-02-08T18:02:35.623 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.03.1-ce. Latest validated version: 18.06^[[0m
2019-02-08T18:02:35.629 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Some fatal errors occurred:^[[0m
2019-02-08T18:02:35.631 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-6443]: Port 6443 is in use^[[0m
2019-02-08T18:02:35.633 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10251]: Port 10251 is in use^[[0m
2019-02-08T18:02:35.637 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10252]: Port 10252 is in use^[[0m
2019-02-08T18:02:35.640 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists^[[0m
2019-02-08T18:02:35.643 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists^[[0m
2019-02-08T18:02:35.645 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists^[[0m
2019-02-08T18:02:35.648 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10250]: Port 10250 is in use^[[0m
2019-02-08T18:02:35.651 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`^[[0m
2019-02-08T18:02:35.653 ^[[1;31mError: 2019-02-08 18:02:35 +0000 kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 2 instead of one of [0]

Reproducibility
---------------
Intermittent

Revision history for this message
Tyler Smith (tyler.smith) wrote :

Input from Matt regarding this:

I did see this once and Bart and I tried to figure out what happened. From the logs it looked like the taint never existed, so either kubeadm didn’t add the taint even though the logs said it did, or it was a race condition.

I wasn’t able to reproduce it to debug it further.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue related to containers. Requires further investigation to better understand reproducibility and impact

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
importance: Medium → Low
assignee: nobody → David Sullivan (dsullivanwr)
tags: added: stx.2019.05
Don Penney (dpenney)
Changed in starlingx:
assignee: David Sullivan (dsullivanwr) → Don Penney (dpenney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/638460

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/638460
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=4b35404d6a03c4bfe6ea12e176d8624710a10b2c
Submitter: Zuul
Branch: master

commit 4b35404d6a03c4bfe6ea12e176d8624710a10b2c
Author: Don Penney <email address hidden>
Date: Thu Feb 21 11:33:30 2019 -0500

    Ignore error on k8s taint removal from puppet

    There are cases where the kubernetes taint is not present on,
    or has already been removed from, a newly configured standby
    controller. This causes the taint removal command run by the
    puppet manifest to fail. This failure can be safely ignored,
    so the command is updated by this commit to always return
    success.

    Change-Id: Icdb55738e052c65a28e44582e345038b0de83c37
    Closes-Bug: 1815795
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (f/stein)

Fix proposed to branch: f/stein
Review: https://review.openstack.org/638513

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (f/stein)

Reviewed: https://review.openstack.org/638513
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=160ec4eca9b999c7dfc1c0a60d40c41998d1e9ed
Submitter: Zuul
Branch: f/stein

commit 52a829d1803056da8222f30dcc002c39c86c6f54
Author: Matt Peters <email address hidden>
Date: Thu Feb 21 11:20:15 2019 -0500

    Temporarily disable iptables restore during puppet

    Docker and kubernetes add rules to iptables, which can end up
    persisted in /etc/sysconfig/iptables by calls to iptables-save.
    When the puppet manifest is applied during node initialization,
    kubernetes is not yet running, and any related iptables rules
    will fail.

    This update disables the restoration of iptables rules from
    previous boots, to ensure the puppet manifest does not fail
    to apply due to invalid rules. However, this means that in
    a DOR scenario (Dead Office Recovery, where both controllers
    will be intializing at the same time), the firewall rules
    will not get reapplied.

    Firewall management will be moved to Calico under story 2005066,
    at which point this code will be removed.

    Change-Id: I43369dba34e6859088af3794de25a68571c7154c
    Closes-Bug: 1815124
    Signed-off-by: Don Penney <email address hidden>

commit cba2b66e9b27efc077b89fb5e661b8dffc890fd8
Author: Erich Cordoba <email address hidden>
Date: Thu Feb 21 11:21:28 2019 -0600

    Move DNS requirement into kubernetes::master

    This was causing a failure in computes unlock process where the
    Platform::Dns class cannot be found.

    Closes-bug: 1817126
    Change-Id: I0a9e9b60580944a49b9672803fc05216f204b222
    Signed-off-by: Erich Cordoba <email address hidden>

commit 4b35404d6a03c4bfe6ea12e176d8624710a10b2c
Author: Don Penney <email address hidden>
Date: Thu Feb 21 11:33:30 2019 -0500

    Ignore error on k8s taint removal from puppet

    There are cases where the kubernetes taint is not present on,
    or has already been removed from, a newly configured standby
    controller. This causes the taint removal command run by the
    puppet manifest to fail. This failure can be safely ignored,
    so the command is updated by this commit to always return
    success.

    Change-Id: Icdb55738e052c65a28e44582e345038b0de83c37
    Closes-Bug: 1815795
    Signed-off-by: Don Penney <email address hidden>

tags: added: in-f-stein
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.