Brief Description
-----------------
Armada pod fails to be scheduled on control plane node because the default taint "node-role.kubernetes.io/master:NoSchedule" applied by kubeadm was not removed by ansible bootstrap task "Remove taint from master node" so pod can be scheduled on master node.
Severity
--------
Major
Steps to Reproduce
------------------
It's hard to reproduce. This happens when bootstrapping 50 subclouds at a time.
Expected Behavior
------------------
Ansible bootstrap completes
Actual Behavior
----------------
armada pod creation failed
Reproducibility
---------------
intermittent
System Configuration
--------------------
Any
Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod -n armada armada-api-7b95f799f4-brkvs Name: armada-api-7b95f799f4-brkvs Namespace: armada Priority: 0 Node: <none> Labels: application=armada component=api pod-template-hash=7b95f799f4 release_group=armada Annotations: configmap-bin-hash: 18bd6a6f166ebd091de412ec635cc785b5eaff9e26242fa0e8c77bb0d88046b0 configmap-etc-hash: 0196a2b125d15f739c2a432c12b290e6825ecc6c7ccd7eae2ff3e5415b53dd42 openstackhelm.openstack.org/release_uuid: prometheus.io/path: /api/v1.0/metrics prometheus.io/port: 8000 prometheus.io/scrape: true Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/armada-api-7b95f799f4 Init Containers: init: Image: registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 Port: <none> Host Port: <none> Command: kubernetes-entrypoint Environment: POD_NAME: armada-api-7b95f799f4-brkvs (v1:metadata.name) NAMESPACE: armada (v1:metadata.namespace) INTERFACE_NAME: eth0 PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/ DEPENDENCY_SERVICE: DEPENDENCY_DAEMONSET: DEPENDENCY_CONTAINER: DEPENDENCY_POD_JSON: DEPENDENCY_CUSTOM_RESOURCE: COMMAND: echo done Mounts: /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Containers: armada-api: Image: registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic Port: 8000/TCP Host Port: 0/TCP Environment: <none> Mounts: /etc/armada from pod-etc-armada (rw) /etc/armada/api-paste.ini from armada-etc (ro,path="api-paste.ini") /etc/armada/armada.conf from armada-etc (ro,path="armada.conf") /etc/armada/policy.yaml from armada-etc (ro,path="policy.yaml") /tmp from pod-tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) tiller: Image: registry.local:9001/gcr.io/kubernetes-helm/tiller:v2.16.1 Port: 24134/TCP Host Port: 0/TCP Command: /tiller --storage=sql --sql-dialect=postgres --sql-connection-string=postgresql://admin-helmv2:sYizMNUPW1L=i*Lt@[2620:10a:a001:ac01::422]:5432/helmv2?sslmode=disable -listen :24134 -probe-listen :24135 -logtostderr -v 5 Liveness: http-get http://:24135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:24135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: TILLER_NAMESPACE: kube-system TILLER_HISTORY_MAX: 0 Mounts: /tmp from tiller-tmp (rw) /tmp/.kube from kubernetes-client-cache (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Conditions: Type Status PodScheduled False Volumes: pod-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> pod-etc-armada: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-bin: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-bin Optional: false armada-etc: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-etc Optional: false kubernetes-client-cache: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> tiller-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-api-token-g64wn: Type: Secret (a volume populated by a Secret) SecretName: armada-api-token-g64wn Optional: false QoS Class: BestEffort Node-Selectors: armada=enabled Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s node.kubernetes.io/unreachable:NoExecute for 30s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 20s (x109 over 160m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
"kubectl get nodes -o json" shows that the node still has "node-role.kubernetes.io/master: NoSchedule" tainted.
"spec": {
"podCIDR": "dead:beef::/80",
"podCIDRs": [
"dead:beef::/80"
],
"taints": [
{ "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" }
]
},
Test Activity
-------------
Developer Testing
Workaround
----------
Manually remove the taint
kubectl taint nodes controller-0 node-role.kubernetes.io/master-
Fix proposed to branch: master /review. opendev. org/c/starlingx /ansible- playbooks/ +/791831
Review: https:/