Armada FailedScheduling during sub cloud deployment

Bug #1928722 reported by Angie Wang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Angie Wang

Bug Description

Brief Description
-----------------
Armada pod fails to be scheduled on control plane node because the default taint "node-role.kubernetes.io/master:NoSchedule" applied by kubeadm was not removed by ansible bootstrap task "Remove taint from master node" so pod can be scheduled on master node.

Severity
--------
Major

Steps to Reproduce
------------------
It's hard to reproduce. This happens when bootstrapping 50 subclouds at a time.

Expected Behavior
------------------
Ansible bootstrap completes

Actual Behavior
----------------
armada pod creation failed

Reproducibility
---------------
intermittent

System Configuration
--------------------
Any

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod -n armada armada-api-7b95f799f4-brkvs Name: armada-api-7b95f799f4-brkvs Namespace: armada Priority: 0 Node: <none> Labels: application=armada component=api pod-template-hash=7b95f799f4 release_group=armada Annotations: configmap-bin-hash: 18bd6a6f166ebd091de412ec635cc785b5eaff9e26242fa0e8c77bb0d88046b0 configmap-etc-hash: 0196a2b125d15f739c2a432c12b290e6825ecc6c7ccd7eae2ff3e5415b53dd42 openstackhelm.openstack.org/release_uuid: prometheus.io/path: /api/v1.0/metrics prometheus.io/port: 8000 prometheus.io/scrape: true Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/armada-api-7b95f799f4 Init Containers: init: Image: registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 Port: <none> Host Port: <none> Command: kubernetes-entrypoint Environment: POD_NAME: armada-api-7b95f799f4-brkvs (v1:metadata.name) NAMESPACE: armada (v1:metadata.namespace) INTERFACE_NAME: eth0 PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/ DEPENDENCY_SERVICE: DEPENDENCY_DAEMONSET: DEPENDENCY_CONTAINER: DEPENDENCY_POD_JSON: DEPENDENCY_CUSTOM_RESOURCE: COMMAND: echo done Mounts: /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Containers: armada-api: Image: registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic Port: 8000/TCP Host Port: 0/TCP Environment: <none> Mounts: /etc/armada from pod-etc-armada (rw) /etc/armada/api-paste.ini from armada-etc (ro,path="api-paste.ini") /etc/armada/armada.conf from armada-etc (ro,path="armada.conf") /etc/armada/policy.yaml from armada-etc (ro,path="policy.yaml") /tmp from pod-tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) tiller: Image: registry.local:9001/gcr.io/kubernetes-helm/tiller:v2.16.1 Port: 24134/TCP Host Port: 0/TCP Command: /tiller --storage=sql --sql-dialect=postgres --sql-connection-string=postgresql://admin-helmv2:sYizMNUPW1L=i*Lt@[2620:10a:a001:ac01::422]:5432/helmv2?sslmode=disable -listen :24134 -probe-listen :24135 -logtostderr -v 5 Liveness: http-get http://:24135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:24135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: TILLER_NAMESPACE: kube-system TILLER_HISTORY_MAX: 0 Mounts: /tmp from tiller-tmp (rw) /tmp/.kube from kubernetes-client-cache (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Conditions: Type Status PodScheduled False Volumes: pod-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> pod-etc-armada: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-bin: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-bin Optional: false armada-etc: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-etc Optional: false kubernetes-client-cache: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> tiller-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-api-token-g64wn: Type: Secret (a volume populated by a Secret) SecretName: armada-api-token-g64wn Optional: false QoS Class: BestEffort Node-Selectors: armada=enabled Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s node.kubernetes.io/unreachable:NoExecute for 30s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 20s (x109 over 160m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

"kubectl get nodes -o json" shows that the node still has "node-role.kubernetes.io/master: NoSchedule" tainted.

"spec": {
        "podCIDR": "dead:beef::/80",
        "podCIDRs": [
        "dead:beef::/80"
        ],
        "taints": [

{ "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" }
        ]
},

Test Activity
-------------
Developer Testing

Workaround
----------
Manually remove the taint
kubectl taint nodes controller-0 node-role.kubernetes.io/master-

Angie Wang (angiewang)
Changed in starlingx:
assignee: nobody → Angie Wang (angiewang)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791831
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/cfc719b82a6f1651a2b3950b316244f907d58491
Submitter: "Zuul (22348)"
Branch: master

commit cfc719b82a6f1651a2b3950b316244f907d58491
Author: Angie Wang <email address hidden>
Date: Mon May 17 17:11:12 2021 -0400

    Configure kubeadm to not apply the default taint

    The taint "node-role.kubernetes.io/master:NoSchedule" needs
    to be removed from master node so that pods can be scheduled
    on it. This is handled by a bootstrap task. However, issue
    was seen that the default taint was not removed during bootstrap
    that causes armada pod fails to be scheduled on controller-0.
    This happens on one of the subcloud when bootstrapping a batch
    of 50 subclouds.

    Add configuration in kubeadm to not apply the default taint
    at the beginning so it doesn't need to be removed afterwards.

    Tested AIO-SX, DX upgrade and a batch of 50 subclouds deployment

    Change-Id: I543280ddd55ec94ccf0586dc07877349baa06bdd
    Closes-Bug: 1928722
    Signed-off-by: Angie Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.6.0 stx.containers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.