Title
-----
Containers: controllerconfig fails on controller-1 due to k8s taint missing
Brief Description
-----------------
I just encountered an issue (2+2 pike load from Feb. 8th) where controllerconfig fails on controller-1 to remove the taint from the newly configured k8s node (not sure why the taint was not found), so the config fails. On second config attempt it tries to run through the same setup, but kubeadm init has already been run so it fails at that point due to the ports already being in use.
Locking and unlocking the failed node just keeps failing in the same way, running kubeadm reset and deleting the node before trying again fixes the issue though. Maybe we should execute that on a failed configure attempt to prepare it for a retry, or check to see if the init command has already succeeded before reattempting.
First failure logs:
2019-02-08T17:52:32.379 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]/returns: deployment.extensions/coredns patched^[[0m
2019-02-08T17:52:32.384 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]/returns: executed successfully^[[0m
2019-02-08T17:52:32.387 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[restrict coredns to master nodes]: The container Class[Platform::Kubernetes::Master::Init] will propagate my refresh event^[[0m
2019-02-08T17:52:32.398 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 Exec[remove taint from master node](provider=posix): Executing 'kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master-'^[[0m
2019-02-08T17:52:32.409 ^[[0;36mDebug: 2019-02-08 17:52:32 +0000 Executing: 'kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master-'^[[0m
2019-02-08T17:52:32.745 ^[[mNotice: 2019-02-08 17:52:32 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[remove taint from master node]/returns: error: taint "node-role.kubernetes.io/master:" not found^[[0m
2019-02-08T17:52:32.764 ^[[1;31mError: 2019-02-08 17:52:32 +0000 kubectl --kubeconfig=/etc/kubernetes/admin.conf taint node controller-1 node-role.kubernetes.io/master- returned 1 instead of one of [0]
Second failure logs:
2019-02-08T18:02:35.166 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Kubeadm/Exec[enable-kubelet]/returns: executed successfully^[[0m
2019-02-08T18:02:35.168 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Kubeadm/Exec[enable-kubelet]: The container Class[Platform::Kubernetes::Kubeadm] will propagate my refresh event^[[0m
2019-02-08T18:02:35.171 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Class[Platform::Kubernetes::Kubeadm]: The container Stage[main] will propagate my refresh event^[[0m
2019-02-08T18:02:35.173 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Class[Platform::Kubernetes::Kubeadm]: The container Class[Platform::Kubernetes::Master] will propagate my refresh event^[[0m
2019-02-08T18:02:35.177 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Exec[configure master node](provider=posix): Executing 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-02-08T18:02:35.180 ^[[0;36mDebug: 2019-02-08 18:02:35 +0000 Executing: 'kubeadm init --config=/etc/kubernetes/kubeadm.yaml'^[[0m
2019-02-08T18:02:35.618 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [init] using Kubernetes version: v1.12.3^[[0m
2019-02-08T18:02:35.621 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] running pre-flight checks^[[0m
2019-02-08T18:02:35.623 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.03.1-ce. Latest validated version: 18.06^[[0m
2019-02-08T18:02:35.629 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] Some fatal errors occurred:^[[0m
2019-02-08T18:02:35.631 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-6443]: Port 6443 is in use^[[0m
2019-02-08T18:02:35.633 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10251]: Port 10251 is in use^[[0m
2019-02-08T18:02:35.637 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10252]: Port 10252 is in use^[[0m
2019-02-08T18:02:35.640 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists^[[0m
2019-02-08T18:02:35.643 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists^[[0m
2019-02-08T18:02:35.645 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists^[[0m
2019-02-08T18:02:35.648 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [ERROR Port-10250]: Port 10250 is in use^[[0m
2019-02-08T18:02:35.651 ^[[mNotice: 2019-02-08 18:02:35 +0000 /Stage[main]/Platform::Kubernetes::Master::Init/Exec[configure master node]/returns: [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`^[[0m
2019-02-08T18:02:35.653 ^[[1;31mError: 2019-02-08 18:02:35 +0000 kubeadm init --config=/etc/kubernetes/kubeadm.yaml returned 2 instead of one of [0]
Reproducibility
---------------
Intermittent
Input from Matt regarding this:
I did see this once and Bart and I tried to figure out what happened. From the logs it looked like the taint never existed, so either kubeadm didn’t add the taint even though the logs said it did, or it was a race condition.
I wasn’t able to reproduce it to debug it further.