StarlingX

Armada FailedScheduling during sub cloud deployment

Bug #1928722 reported by Angie Wang on 2021-05-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Angie Wang

Bug Description

Brief Description
-----------------
Armada pod fails to be scheduled on control plane node because the default taint "node-role.kubernetes.io/master:NoSchedule" applied by kubeadm was not removed by ansible bootstrap task "Remove taint from master node" so pod can be scheduled on master node.

Severity
--------
Major

Steps to Reproduce
------------------
It's hard to reproduce. This happens when bootstrapping 50 subclouds at a time.

Expected Behavior
------------------
Ansible bootstrap completes

Actual Behavior
----------------
armada pod creation failed

Reproducibility
---------------
intermittent

System Configuration
--------------------
Any

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl describe pod -n armada armada-api-7b95f799f4-brkvs Name: armada-api-7b95f799f4-brkvs Namespace: armada Priority: 0 Node: <none> Labels: application=armada component=api pod-template-hash=7b95f799f4 release_group=armada Annotations: configmap-bin-hash: 18bd6a6f166ebd091de412ec635cc785b5eaff9e26242fa0e8c77bb0d88046b0 configmap-etc-hash: 0196a2b125d15f739c2a432c12b290e6825ecc6c7ccd7eae2ff3e5415b53dd42 openstackhelm.openstack.org/release_uuid: prometheus.io/path: /api/v1.0/metrics prometheus.io/port: 8000 prometheus.io/scrape: true Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/armada-api-7b95f799f4 Init Containers: init: Image: registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 Port: <none> Host Port: <none> Command: kubernetes-entrypoint Environment: POD_NAME: armada-api-7b95f799f4-brkvs (v1:metadata.name) NAMESPACE: armada (v1:metadata.namespace) INTERFACE_NAME: eth0 PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/ DEPENDENCY_SERVICE: DEPENDENCY_DAEMONSET: DEPENDENCY_CONTAINER: DEPENDENCY_POD_JSON: DEPENDENCY_CUSTOM_RESOURCE: COMMAND: echo done Mounts: /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Containers: armada-api: Image: registry.local:9001/quay.io/airshipit/armada:8a1638098f88d92bf799ef4934abe569789b885e-ubuntu_bionic Port: 8000/TCP Host Port: 0/TCP Environment: <none> Mounts: /etc/armada from pod-etc-armada (rw) /etc/armada/api-paste.ini from armada-etc (ro,path="api-paste.ini") /etc/armada/armada.conf from armada-etc (ro,path="armada.conf") /etc/armada/policy.yaml from armada-etc (ro,path="policy.yaml") /tmp from pod-tmp (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) tiller: Image: registry.local:9001/gcr.io/kubernetes-helm/tiller:v2.16.1 Port: 24134/TCP Host Port: 0/TCP Command: /tiller --storage=sql --sql-dialect=postgres --sql-connection-string=postgresql://admin-helmv2:sYizMNUPW1L=i*Lt@[2620:10a:a001:ac01::422]:5432/helmv2?sslmode=disable -listen :24134 -probe-listen :24135 -logtostderr -v 5 Liveness: http-get http://:24135/liveness delay=1s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get http://:24135/readiness delay=1s timeout=1s period=10s #success=1 #failure=3 Environment: TILLER_NAMESPACE: kube-system TILLER_HISTORY_MAX: 0 Mounts: /tmp from tiller-tmp (rw) /tmp/.kube from kubernetes-client-cache (rw) /var/run/secrets/kubernetes.io/serviceaccount from armada-api-token-g64wn (ro) Conditions: Type Status PodScheduled False Volumes: pod-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> pod-etc-armada: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-bin: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-bin Optional: false armada-etc: Type: ConfigMap (a volume populated by a ConfigMap) Name: armada-etc Optional: false kubernetes-client-cache: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> tiller-tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> armada-api-token-g64wn: Type: Secret (a volume populated by a Secret) SecretName: armada-api-token-g64wn Optional: false QoS Class: BestEffort Node-Selectors: armada=enabled Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s node.kubernetes.io/unreachable:NoExecute for 30s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 20s (x109 over 160m) default-scheduler 0/1 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

"kubectl get nodes -o json" shows that the node still has "node-role.kubernetes.io/master: NoSchedule" tainted.

"spec": {
        "podCIDR": "dead:beef::/80",
        "podCIDRs": [
        "dead:beef::/80"
        ],
        "taints": [

{ "effect": "NoSchedule", "key": "node-role.kubernetes.io/master" }
]
},

Test Activity
-------------
Developer Testing

Workaround
----------
Manually remove the taint
kubectl taint nodes controller-0 node-role.kubernetes.io/master-

Tags:

Angie Wang (angiewang) on 2021-05-17

Changed in starlingx:
assignee:	nobody → Angie Wang (angiewang)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-17: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791831

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-18: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791831
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/cfc719b82a6f1651a2b3950b316244f907d58491
Submitter: "Zuul (22348)"
Branch: master

commit cfc719b82a6f1651a2b3950b316244f907d58491
Author: Angie Wang <email address hidden>
Date: Mon May 17 17:11:12 2021 -0400

Configure kubeadm to not apply the default taint

    The taint "node-role.kubernetes.io/master:NoSchedule" needs
    to be removed from master node so that pods can be scheduled
    on it. This is handled by a bootstrap task. However, issue
    was seen that the default taint was not removed during bootstrap
    that causes armada pod fails to be scheduled on controller-0.
    This happens on one of the subcloud when bootstrapping a batch
    of 50 subclouds.

Add configuration in kubeadm to not apply the default taint
at the beginning so it doesn't need to be removed afterwards.

Tested AIO-SX, DX upgrade and a batch of 50 subclouds deployment

    Change-Id: I543280ddd55ec94ccf0586dc07877349baa06bdd
    Closes-Bug: 1928722
    Signed-off-by: Angie Wang <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to ansible-playbooks (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Ghada Khalil (gkhalil) on 2021-05-19

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.6.0 stx.containers

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02:

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-03: Fix merged to ansible-playbooks (f/centos8)

Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

Revert "Restore host filesystems with collected sizes"

This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

This change adds a new task which brings
the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

Support boo...

Reviewed:  https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Sat May 22 15:48:19 2021 +0000

Revert "Restore host filesystems with collected sizes"
    
    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.
    
    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.
    
    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <angie.wang@windriver.com>
Date:   Fri May 21 21:28:02 2021 -0400

Ensure apiserver keys are present before extract from tarball
    
    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.
    
    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <don.penney@windriver.com>
Date:   Thu May 20 23:09:07 2021 -0400

Update SX to DX migration to wait for coredns config
    
    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Wed May 19 09:08:16 2021 +0000

Fixed missing apiserver-etcd-client certs
    
    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}
    
    This change adds a new task which brings
    the certs from /etc/kubernetes/pki
    
    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Wed May 19 16:01:27 2021 -0500

Support bootstrap replay with networking changes
    
    Currently bootstrap playbook replay will fail if the management or
    cluster host networks are changed. To resolve this a couple of changes
    are needed:
    
    * Restart the sysinv agent and wait until it is ready. The sysinv agent
      uses the current management ip for the rabbitMQ connection and
      applying runtime manifests. The process needs to be restarted to
      resync that data.
    
    * Copy the etcd certs to the /opt/platform on replay. The etcd-server
      certs are regenerated on replay. When the cluster host network changed
      the SAN in the certs under /opt/platform were out of date resulting in
      kube-apiserver failures on controller-0 unlock.
    
    Closes-Bug: 1925668
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I228321a2540a0024cd217ed844feb54be9ae3b29

commit 41ada83e4f4486d0795eed3e7a8bbe4227ee88d8
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Wed May 19 12:53:26 2021 -0500

Bug fix: update barbican external id if the project id changes
    
    The previous commit
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/790824
    unexpectedly moves the update_barbican_project_external_id out of the
    scope of the project id changes. Further, the old_id['id'] is
    duplicated with the current_project_id, it will be undeclared in the
    case of the project id not changed (by deleting the subcloud in
    powered off status and re-add it with --migrate option). This commit
    deletes the duplicated variable and moves the
    update_barbican_project_external_id to the correct scope.
    
    Tests:
    1 Delete a subcloud from a central cloud and add it back with the
    migrate option.
    2 Migrate a subcloud successfully to a different central cloud.
    3 Migrate a subcloud successfully with an extra nonlocal user to a
    different central cloud.
    
    Partial-Bug: 1928139
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I8acf7bcea5d8e77b92877427ccd40b95e8b3515e

commit ac0c5d51a8708c8c056a736ff11ce1a0b1550c4f
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu May 13 19:53:41 2021 +0300

Create the pod_max_pids service parameter
    
    This adds a default entry to service parameters.
    
    Create the default entry taking into consideration the most hungry of
    the optional StarlingX apps. The user is free to modify the value as
    desired, using 'system service-parameter-modify'.
    
    Same can be created by the user using 'system service-parameter-add',
    but this helps the user by being transparent in service-parameter-list.
    
    If this service parameter was missing an entry, then no hieradata
    variable would have been generated, so puppet would have used
    a predefined value.
    
    Partial-Bug: 1928353
    Depends-On: I74fcf2bd405c2a3811a4f27a55b28c0d001430e1
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I707ddc4ca67595fbf809c6ffc15ecd4fb21f4661

commit 64bf73c85c5de0737c8a1cf967b6b251288ee424
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue May 11 16:22:12 2021 +0300

Enable kubelet support for pod pid limit
    
    This protects the system before the unlock. This has the most meaning
    during the restore procedure, when the system is unprotected until
    unlock (until puppet generates the config file containing protection).
    
    Partial-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I09c4d4f494bc113ae8b439256655476e03b54b0e

commit cfc719b82a6f1651a2b3950b316244f907d58491
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon May 17 17:11:12 2021 -0400

Configure kubeadm to not apply the default taint
    
    The taint "node-role.kubernetes.io/master:NoSchedule" needs
    to be removed from master node so that pods can be scheduled
    on it. This is handled by a bootstrap task. However, issue
    was seen that the default taint was not removed during bootstrap
    that causes armada pod fails to be scheduled on controller-0.
    This happens on one of the subcloud when bootstrapping a batch
    of 50 subclouds.
    
    Add configuration in kubeadm to not apply the default taint
    at the beginning so it doesn't need to be removed afterwards.
    
    Tested AIO-SX, DX upgrade and a batch of 50 subclouds deployment
    
    Change-Id: I543280ddd55ec94ccf0586dc07877349baa06bdd
    Closes-Bug: 1928722
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 45c74db36670e9cba3475e598a4655490b744cee
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu May 13 10:46:02 2021 -0500

Cleanup of subcloud rehoming playbook
    
    As the ansible log is accessible for all the users, we don't want to
    expose the keystone IDs and passwords in a subcloud's ansible log.
    This commit hides the keystone IDs and passwords in the ansible log.
    
    This commit also removes un-used facts and the rehome_in_progress
    flag.
    
    Tested with successfully re-homed a SX and a DX subcloud in virtual
    box env.
    
    Story: 2008774
    Task: 42462
    
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I6c2e83ba5b4923e9d7c82ccf94165608739e59e1

commit 0ca19e0870e06b601553ea2a9d9e1cfc0367d75f
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Tue May 11 16:02:49 2021 -0500

Clean non-local keystone user during migrating keystone IDs
    
    During the progress of re-homing a sub cloud to a new central cloud,
    insert a new user ID will be failed if there's a duplicated non-local
    user in keystone database. This commit deletes the non-local user with
    the same user id before inserting a new keystone local user into the
    database.
    
    Test:
    1. Add a non-local user in keystone database, and this user should
    have a same user ID as the new central cloud.
    2. Successfully re-home this subcloud to the new central cloud.
    3. Successfully re-home a subcloud without the duplicated user in
    keystone database.
    
    Partial-Bug: 1928139
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: Iecaac3400cb9362acc63686ae9470cf01f7eb2e1

commit 275e046280a71b0c5c8c637fdc7065d4e2e19aee
Author: Isac Souza <IsacSacchi.Souza@windriver.com>
Date:   Thu May 13 13:07:56 2021 -0300

Fix deletion of temp dir when backup fails
    
    When the /opt/backups disk is full and the backup playbook
    is executed, the Ansible task fails to create a temp directory.
    After the failure the cleanup code is executed but if the temp
    dir was not created, it fails because tempdir.path will be undefined.
    This second failure aborts the cleanup code and the
    "System Backup in progress." will stay on indefinitely, requiring
    manual intervetion.
    
    Tested by executing the backup procedure when /opt/backups
    is full.
    
    Closes-Bug: 1928365
    Signed-off-by: Isac Souza <IsacSacchi.Souza@windriver.com>
    Change-Id: I76c852b65bde4b40a22bcb9be3a81776ade86d15

commit 02b36df15ba42c4346b34b4ddce29d95b4c7fc69
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Thu May 13 12:17:06 2021 -0400

Enable ssh connection retries in Ansible playbooks
    
    In this commit:
     - Allow ssh retries in the playbooks in case of unreachable
       failures that are observed in batch subcloud deployment or
       upgrade where the system controller has low network bandwidth.
     - Increase ssh timeout to accomodate slow sudo response when
       ldap service is not yet available for the subcloud.
     - Allow docker login and images pull retries in case docker
       or docker proxy throws an exception due to slow response
       (e.g. oam network is overwhelmed).
    
    Closes-Bug: 1928357
    Change-Id: Ibc6155671b20a01340b66270c3c402174d34ab9e
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 20e44c6dd71758f89bbfe88c6204e777c17deb5d
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Tue Apr 27 01:40:16 2021 +0800

Fix bootstrap replay failure when changing mgmt subnet
    
    After mgmt subnet is changed, we use previous controller_0
    address for etcd puppet apply to avoid resycing an nonexistent
    hieradata file.
    
    Change-Id: Ie31c48153af30df240237013dd51bfffea5213cd
    Closes-Bug: 1925668
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 7228f60e146286d0948ac67bee159a7b8c54f704
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue May 11 18:56:03 2021 +0300

Fix restore user images playbook
    
    When running restore_user_images playbook,
    after the images will be uploaded to the local registry,
    the docker images stored in the local docker filesystem
    will be deleted by a python script called by the playbook.
    Sometimes this script will try to delete images
    that are not present into the local docker filesystem and
    will throw some errors.
    We fix this by adding a check before trying to delete an image.
    
    Also, updated the
    /playbooks/roles/common/push-docker-images/files/download_images.py
    file to do the same checks since it does similar steps
    and can also fail when deleting
    the images from the local docker cache.
    
    Closes-Bug: 1928092
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I1be45b2669ad1ab209aa49f050106dd1a7759cee

commit d5460198dc0310a80580537fd8df76ae00e17f02
Author: Robert Church <robert.church@windriver.com>
Date:   Wed May 12 22:45:38 2021 -0400

Adjust armada's tiller container liveness probe
    
    With the liveness probe update in the armada helm chart to test the
    connectivity to the postgres backend, adjust the periodSeconds and
    failureThreshold to align with the minimum swact time to be expected for
    postgres switching from one controller to another.
    
    Reviewing logs from various H/W labs it appears that average postgres
    swact time ranges from 9s-20s, with the mean ~15s.
    
    Times can be observed with:
    2021-05-09T13:32:24.475 controller-1 OCF_pgsql(postgres)[396293]: info
                                         INFO: server shutting down
    2021-05-09T13:32:33.423 controller-0 OCF_pgsql(postgres)[147541]: info
                                         INFO: server starting
    
    Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
    postgres server is not accessible, the tiller container will be
    restarted within the 9s minimum swact time. This will ensure that the
    next time tiller is required by Armada or used by the helmv2-cli that
    the connection to postgres backend has been re-established.
    
    Change-Id: I7454a737771d9a608d2fe69c5136d37da022007e
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/791092
    Related-Bug: #1917308
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9d4203b7e0d4b321fcccf14e69367573ef5d5093
Author: Tao Liu <tao.liu@windriver.com>
Date:   Wed May 12 20:30:37 2021 -0400

Use async_timeout for kubectl wait timeout
    
    Data migration failed for 5 subclouds during 30 subcloud upgrade.
    This failure was caused by timeout waiting for Kubernetes component,
    Networking or Armada pods to be ready.
    
    The async_timeout is set to 120s in restore mode, but the pod wait time
    is still set to 30s. This update changes the pod wait time to be the
    same as async_timeout.
    
    Tested on DC-2 with 50 subcloud parallel upgrade
    
    Closes-Bug: 1928252
    
    Signed-off-by: Tao Liu <tao.liu@windriver.com>
    Change-Id: If2b21d0cd2e0de9e84869323a43d4a249d031132

commit 36451c99ce76e76084ad5c68e4954bf347e8c0b7
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue May 11 16:24:26 2021 +0300

Add helm sql database ip to armada overrides
    
    This will be used by tiller container to check that the container
    networking is properly set up.
    
    Partial-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I177bb628497611eb64472291a04d635856c26590

commit 86fbafec14744d6630aff6e6ac2bb165c5be2be8
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Tue May 11 14:58:29 2021 -0400

Fix: Upgrade does not replay correctly after activate failure
    
    When the etcd runtime manifest fails during upgrade activate,
    the activation fails as expected. However when the upgrade-activate
    is attempted again, the etcd upgrade playbook is not re-run.
    Depending on when the manifest fails, this could result in secure
    etcd remaining disabled on the system.
    
    This commit makes it possible to replay the etcd upgrade playbook,
    thus solving the issue.
    
    Change-Id: I7f453d9040916381519ac96ed4567ec5fb6e7a8d
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>
    Closes-Bug: 1928130

commit 2dfadea581d1ac9955a7921bd2730e36a152255c
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu Mar 11 11:12:05 2021 -0500

Add a rehome subcloud playbook
    
    This commit creates a rehome_subcloud playbook to migrate an existing
    subcloud from the original central cloud to a new central cloud. This
    playbook is expected to be remotely play from the new central cloud
    system controller based on the subcloud's overrides. This playbook
    updates the system controller network info in the subcloud, migrates
    the keystone data and updates the subcloud's ca and certs. After
    running this playbook and unlock the subcloud controllers, the
    subcloud can be discovered online in the new central cloud, and
    can be managed and brought in-sync from the new central cloud.
    
    Test:
    Successfully migrate a AIOSX and a AIODX subcloud to the new central
    cloud with this playbook.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/784767
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/785977
    Depends-On: https://review.opendev.org/c/starlingx/config/+/786638
    Depends-On: https://review.opendev.org/c/starlingx/config/+/787213
    
    Story: 2008774
    Task: 42152
    Change-Id: Iaa6699951e855c76602bd43f71d42e64e298c786
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit d23b932208b778ec73cc51bccf460310f16ceaad
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon May 3 15:32:06 2021 -0400

Ensure n3000-opae image cache is not deleted
    
    Skip removing n3000-opae image cache. This image is expected to be
    available to reset n3000 fpgas before docker local registry is ready.
    
    Closes-Bug: 1927000
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I3372d89c48b394c8cac6b9da06d2528bb6afa803

commit 255488739efa4ac072424b19f2dbb7a3adb0254e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Thu Apr 29 16:37:21 2021 +0300

Restore host filesystems with collected sizes
    
    Since https://review.opendev.org/c/starlingx/ansible-playbooks/+/784860,
    the host filesystems(backup, docker, kubelet, scratch) are
    no longer resized in ansible at restore and they are not using the
    collected sizes from the backup archive. Puppet will try to
    resize them when unlocking but this will generate some errors.
    
    The solution is to create the host filesystems with the
    correct sizes at restore. The sizes are taken from the
    backup archive.
    
    Closes-Bug: 1926591
    Change-Id: Id670408a518e4a1e3fc75a668eea42d26a972d66
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit bc0fba6bbbd0182c4886e5a3ccbfc2d0973cfd70
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 28 11:52:36 2021 -0400

Remove restore subcloud admin endpoint certificate from config
    
    This change is to exclude the admin endpoint certificate from restore
    to config directory. The admin endpoint certificate is stored in k8s
    (backup) and restore as part of k8s restore. Sysinv will generate it
    into hieradata from k8s secret and puppet will genereate the pem for
    haproxy.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/786666
    Partial-Bug: 1923510
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: Iae8fb9c53e0aa6797a25b872adb0c99636c4243a

commit e88c9290a3ca2d947828a1f4e988f9fe61a1a623
Author: Melissa Wang <melissa.wang@windriver.com>
Date:   Mon Apr 26 17:45:40 2021 -0400

SX to DX migration: Check network interface config
    
    This change adds a semantic check to ensure that the cluster-host
    and management networks are not configured on loopback before
    allowing sx-to-dx migration.
    
    Task: 42375
    Story: 2008587
    
    Change-Id: I87326db222ffe9eb8bf23b69d17f676abc7c242d
    Signed-off-by: Melissa Wang <melissa.wang@windriver.com>

commit e25439d49d127779f9ab32650a4a51027242884b
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 14 14:53:56 2021 -0400

Remove creating admin endpoint cert in subcloud bootstrap
    
    This change removes creating admin endpoint cert in subcloud
    bootstrap.
    The admin endpoint cert is generated in manifest at the time when
    the controller node is unlock the first time. The cert data is
    retrieved directly from k8s secret data (where cert-manager is
    responsible to maintain it and keep it up to date).
    
    Partial-Bug: 1923510
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/786666
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: Ie6a5c8fe159efcdebdb4c81666e981772408b82c

commit 5b286734637ca6f503a62131b309829d2f308fed
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Mon Apr 5 17:01:48 2021 -0500

Remove drbd resize actions from bootstrap playbook
    
    drbd resize actions can cause failures during unlock. Avoid all drbd
    resize actions during the bootstrap playbook. Pass the correct
    filesystem sizes to the bootstrap manifest. Use the collected drbd
    filesystem sizes during the restore/upgrade bootstrap manifest.
    
    Closes-Bug: 1920245
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I235575ad40ba84298d3db5b39ce7861a143c78a8

commit 05d3d8c21f5017b26466bfd1cdd1f1e7accf266f
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Apr 12 10:35:57 2021 -0400

Fix kube-apiserver pod removal
    
    This commit adds rolebinding configuration to bind
    the "privileged-psp-user" role to the kubelet user.
    It fixes the issue where the kube-apiserver pod does
    not get recreated after enabling PodSecurityPolicy
    plugin. With this fix we make sure to allow the apiserver
    pod creation by granting permission to the kubelet user
    to create that pod.
    
    Closes-Bug: 1881605
    
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: Ibdf6d4cacf2ce83dfa744455dac460027b2a6e47

commit 802afc1c23d26f44bd7954a5ec038c5badddcaa3
Author: Andrei Grosu <andrei.grosu@windriver.com>
Date:   Mon Feb 8 22:28:49 2021 +0000

Inform conductor of ansible backup actions.
    
    Let apps run custom code around backup actions inside the playbook.
    
    Story: 2007960
    Task: 40769
    Depends-On: I0ebab45f4846cbcd25fecac6bf99195d9047eb8a
    
    Signed-off-by: Andrei Grosu <andrei.grosu@windriver.com>
    Change-Id: I61156db05970aa03c96ddc8533fdd4f4a680b334

commit f6cd32b82bc01c9f1b9f53d1754aac6a45e51643
Author: Don Penney <don.penney@windriver.com>
Date:   Fri Mar 26 11:23:51 2021 -0400

Copy default-registry-key to deployment namespace
    
    After creating the deployment namespace, copy the default-registry-key
    from the kube-system namespace into it for use by pods running in that
    namespace.
    
    Story: 2007361
    Task: 42186
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I3b431c74295cc0099f00814ee28709b1b4c56c8c

commit d61c82f555034f104e1bee8a83bb19ad448012b2
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Wed Mar 31 23:12:43 2021 +0800

Change CN of etcd CA since etcd will reuse kubernetes CA
    
    Basic test pass on simplex.
    
    Closes-Bug: 1921511
    
    Change-Id: If3a7cca4a03b05ac5eb61f7f579449a7393c1644
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 8a7c871a8d7c8843b404e6779eccd2e483293c2a
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Mar 22 15:32:48 2021 -0400

Add default psp configuration to psp-policies.yaml
    
    Added ClusterRoleBindings configuration in psp-policies.yaml to
    enhance the current capabilities.
    The new configuration gives users the ability that by default,
    they can create pods and deployments with “restricted” capabilities
    in their namespaces.
    When psp policies are applied at bootstrap time, all tenants/users
    have at least restricted capabilities and access rights will be
    added as needed, reducing this way the manual configuration
    required.
    
    Closes-Bug: 1885716
    
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: I6118efd6832836829d31aa84b2b4305d5a1f24c4

commit f7d5491e404ba2854105278f6b9e2883b52a5206
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Tue Aug 25 12:16:22 2020 -0400

Remove CPU resource requests from platform pods
    
    In order to allow applications to make use of all "application" cores,
    we need to remove all CPU resource requests from "platform" pods.  We tried
    to make this work by modifying Kubernetes itself to track platform pod resources
    separately, but it would have required changes to kube-scheduler, which would
    involve building a custom kube-scheduler container image.
    
    Some platform pods are created by "kubeadm" during early init and are handled
    via a kubernetes code change.  For calico, multus, and sriov we modify the
    templates used to create the pods and remove any reference to CPU resources.
    
    Normally this would mean that they could get throttled by application pods, but
    since they're running on the platform cores this shouldn't be a problem.
    
    Story: 2008760
    Task: 42170
    Change-Id: Ibf73bd0d105e4040f02a4114afbb31d131fc9585
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 41de2e52db9985d84397d7ba56a59bbeaa9cf88f
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Fri Mar 19 01:00:10 2021 +0800

Fix bootstrap replay fails due to not changing etcd config.
    
    When replay bootstrap, we still need to apply etcd puppet manifest.
    Test PASS for below cases
    1）System bring up with bootstrap replay
       - non cluster host network reconfiguration
    2）System bring up with bootstrap replay
       - cluster host network change
    3）AIO simplex backup and restore
    
    Closes-Bug: 1918943
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>
    Change-Id: Ib892776f4b2949f9255fc4725add1f0e362956a9

commit fbd6a848d53ae6de9703466215fb85df2485ab0b
Author: Scott Little <scott.little@windriver.com>
Date:   Thu Mar 25 11:35:34 2021 -0400

Set SW_VERSION 21.05
    
    Prep for the StarlingX 5.0 release.
    SW_VERSION uses YY.MM format.
    
    Story: 2008055
    Task: 42113
    Depends-On: https://review.opendev.org/c/starlingx/utilities/+/783042
    Signed-off-by: Scott Little <scott.little@windriver.com>
    Change-Id: Ic718ed214dfdb6eb8fcfffcadbbeadfbb8f6d052

commit 2ffd9c4bff6bd736f35cc1fb2fd3b0e25c9ef8f2
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Fri Mar 19 10:36:55 2021 -0500

drbd filesystems not resized during bootstrap
    
    Remove drbd resize_result check on resize2fs operation. Both operations
    should run when requested. These commands will return 0 when the disk is
    resized and will return 0 if the disk is already correctly sized. Any
    non-zero return code should fail the playbook.
    
    Closes-Bug: 1920245
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I946f63b2886b5377494e658a59586005c27ec2d2

commit 8c2580cc85b8e7a2783ece895d7c2c476db81990
Author: Don Penney <don.penney@windriver.com>
Date:   Wed Mar 10 12:13:56 2021 -0500

Introduce SX to DX migration playbook
    
    This commit introduces a migrate-subcloud.yml playbook that the user
    can run, with an overrides file to provide config values, to perform
    the migration steps for a subcloud. Once the migration playbook has
    been applied and the subcloud has recovered from the unlock (performed
    by the playbook), the second controller can be installed and
    configured.
    
    Story: 2008587
    Task: 41743
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776536
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I1d8c1219694147baaabb183ef3debe1715aaf153

commit 64feca44b59284b07d42e7100fb90b22297c0ce0
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Fri Mar 5 15:18:46 2021 -0500

Use jinja2 template for containerd config.toml
    
    File "/etc/containerd/config.toml" is set up with ansible bootstrap
    using an erb template managed by puppet.
    We replace the erb template with a jinja2 template to improve ease
    of use and reduce the need for complex regular expressions.
    
    Closes-Bug: 1892768
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: I93601321d4f554d27bb9457bcff99428351cbefd

commit df1f8e381d951a2ab6c52a32eb2467817795e083
Author: Suvro Ghosh <suvrojeet.ghosh@windriver.com>
Date:   Thu Mar 11 10:38:57 2021 -0500

Adding force flag to purge task
    
    This flag will allow ansible replay
    
    Story: 2007960
    Task: 42039
    Signed-off-by: Suvro Ghosh <suvrojeet.ghosh@windriver.com>
    Depends-On: If68d66d799addcd996da4b146d092c855b455aa3
    Change-Id: I93821965184d95a00fddd3398a1c214e3d730efa

commit ed7314d4c4c1f9d6f18dc6c144dd4b6cdf67b66e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Mon Mar 8 14:36:54 2021 +0200

Fix bootstrap playbook when initializing kubernetes
    
    The etcd endpoint in the kubedeadm file is different than
    the endpoint defined in the etcd config file.
    This is because in the kubeadm file, the etcd endpoint is equal to
    the 'cluster_floating_address' variable. And in the etcd config
    file the 'default_cluster_host_start_address' variable is used.
    These 2 variables can be different when the
    'cluster_host_start_address' variable defined in the localhost.yml
    differs from the first address of the 'cluster_host_subnet'.
    
    The solution is to use 'cluster_floating_address' in both cases
    because this variable is defined in the following way:
    
    cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}"
    will be different than the 'default_cluster_host_start_address': "{{
    (cluster_host_subnet | ipaddr(1)).split('/')[0] }
    
    So it will use 'default_cluster_host_start_address' when the
    'cluster_host_start_address' is not defined.
    
    Closes-Bug: 1918130
    Change-Id: I8fecc1e5e54b5a9a9a72a54c069f79f5f2d434ba
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit ff5c9327c1990cb2bf6454a1a156669187dc056b
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Feb 24 05:45:30 2021 +0000

Remove container runtime interface (CRI) placeholder
    
    Change-Id Ib1dd5bd (stx-puppet) and Icc5fd16 (config) add
    support to set CRI entries for kubernetes runTimeClass.
    
    This is not needed during the initial bring-up of services
    at bootstrap time and produces incorrect config.toml causing
    bootstrap failure. Therefore, it is removed from the initial
    config.toml generated during bring up of essential services.
    
    Story: 2008434
    Task: 41928
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776220
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/776223
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Id124653329da0ba9f990e5bba6b53faa3c88fa86

commit ae898b781539f71ea009cefec7602ed62335741b
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:22:39 2021 -0500

Create device_images bind directory
    
    The device images are stored in the drbd filesystem
    (/opt/platform/device_images) in the active controller.
    In order to allow the other worker hosts to retrieve the device images
    from the active controller over lighttpd, the directory
    /www/pages/device_images is created as a bind mount of the drbd
    directory.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41878
    
    Change-Id: I00c75767543d3840c466df887e9f16ba75a5386d
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 9814ec0490506f222326396368f34e278ae52a0d
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Mon Feb 22 23:07:34 2021 -0500

Update Ansible bootstrap address validation
    
    The method to validate bootstrap address in previous
    commit (9c62c83536b737e731b140c95be37d74769989ff) is not
    reliable for IPv6. This commit fixes it.
    
    Task: 41800
    Story: 2008573
    Change-Id: Ibadf36e7f6c1ec31ca47514802991c92959fd138
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 9cee892563c68154714bcb7a7e173dcc13b6b237
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Feb 24 14:27:14 2021 -0500

Update the minimum root disk size to be aligned with sysinv
    
    Change-Id: I03c8b31ab76ce8a2b8534677910763150ec1d9c0
    Closes-bug: 1916797
    Depends-On: https://review.opendev.org/c/starlingx/config/+/777465
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit f2d20c15bbda992bec707b9dd3529f5f1fd53b83
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue Feb 2 18:09:33 2021 +0200

Fix restoring dc-vault on a central controller
    
    At this moment, when we do a restore procedure on a
    DC system controller, the /opt/dc-vault directory will be
    created under "/" filesystem. It should be created on
    a separated filesystem, but that filesystem is available
    only after an unlock of the controller.
    
    The proposed solution is to create an additional restore
    playbook for the dc-vault that will be manually run after
    unlocking controller-0. The backup playbook will create
    an additional archive with the contents of dc-vault, and
    the dc-vault directory will be removed from the platform
    backup.
    
    The new playbook will be used like this:
    
    ansible-playbook
    /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e
    "ansible_become_pass=Li69nux*" -e "admin_password=Li69nux*" -e
    "initial_backup_dir=/home/sysadmin" -e
    "backup_filename=localhost_dc_vault_backup_2021_02_02_11_46_09.tgz"
    
    Closes-Bug: 1914258
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I8fdd5b678e2296cd0ce98ea4dd91e2988beb200f

commit 3babc1eed3ba861c652f082655fa284992be0859
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Fri Feb 19 16:45:22 2021 +0200

B&R: Fix registry backup generated when it should not
    
    Incorrect ansible variable evaluation results in the a backup file
    generated when it is required not do so.
    
    Fix the evaluation.
    
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Depends-On: I4644784ea4164134f163d218e69dc4ceb148985a
    Closes-Bug: 1916246
    Change-Id: I2a31dcda55137a668b2e82b9a938535bdf623656

commit 151b885dc733d5094fce474ae042ee2f2ceae49e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Feb 19 16:21:23 2021 +0200

B&R: Fix backup hangout on IPv6 systems
    
    IPv6 addreses must be enclosed in square brackets when they are
    followed by a port number.
    
    At the "[backup/backup-system : Create etcd snapshot]"
    step, the etcd IPv6 endpoint is not wrapped in square brackets,
    so the command hangs indefinitely.
    
    We fix this by using the 'ipwrap' ansible filter which will
    wrap the address in [] brackets if it's an IPv6 one.
    
    Closes-bug: 1916053
    Change-Id: If40ed59f4e44c9f877aaefe87f6211a3e83ddfee
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit 71db4f1aa82a36c360ada97a6d8d989a736e1133
Author: Steven Webster <steven.webster@windriver.com>
Date:   Wed Feb 17 13:54:14 2021 -0500

Uprev SR-IOV CNI image
    
    This commit uprevs the SR-IOV CNI image to pick up a few bug
    fixes.  Specifically, this commit will allow rate-limiting
    configuration on a VF to be retained after the VF has been
    used by a pod (and pod subsequently deleted).
    
    Testing:
    
    NICs:
    Ethernet Controller X710 for 10GbE SFP+
    Mellanox MT27700 Family [ConnectX-4]
    
    Functional:
    Connectivity testing (kernel + DPDK)
    Devices allocated appropriately to pod
    Rate-limiting information retained after pod deletion
    
    Closes-Bug: #1915951
    
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: Ie893fee61fa15e28e1994b5e766ed3bca2ff4050

commit 9c62c83536b737e731b140c95be37d74769989ff
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Tue Feb 9 23:08:32 2021 -0500

Ansible update to support remote subcloud restore
    
    In this commit:
      - Common code to validate target is updated to include additional
        host check options (system readiness, load version, bootstrap ip,
        and patches).
      - A new playbook is added to perform host check for various
        use cases (remote install, pre subcloud restore, pre subcloud
        upgrade, etc...).
      - Install playbook is updated to make use of the new playbook.
      - A new role which performs generic user input validation for all
        restore playbooks is added.
      - A new B&R parameter is added to indicate where the backup data
        can be found, on the host itself (on-box) or on another
        machine (off-box).
      - Platform, user images and openstack restore playbooks are
        updated to a) make use of the on_box_data parameter,
        b) use the same target_backup_dir for both local and remote
        playbook execution for consistent behavior.
      - Host override file is extracted only on the target.
      - Patches restore is skipped if requested by the caller. Default
        behavior is to restore patches.
      - Various subtle bugs are fixed.
      - Ansible version is specified in Zuul test requirements.
    
    Tests:
      - Deployment of a Redfish capable subcloud
      - Remote restore a simplex with various options: a) without patches
        b) skip patches restore, c) with patches restore, d) on-box backup
        tarball, e) off-box backup tarball
      - Local restore of a simplex with 2 options: a) without patches and
        b) skip patches restore
      - Simplex fresh install
      - Restore user images
      - Restore OpenStack
      - Error cases
    
    Task: 41725
    Story: 2008573
    Change-Id: Ica2b9010a73854a01216e2e16b581484d182264e
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 64af997f72d2e78615bd38d573aa13d0a59932da
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu Feb 4 17:55:24 2021 -0500

Fetch additional image config file for remote play
    
    When invoking the playbook remotely, the additional system images
    config file locates in the remote host may not also exist in the
    control host. This commit fetches the additional image config file to
    the control host to prevent include_vars failure.
    
    Tests:
    1 Remote backup and restore an AIOSX node with the additional system
    images config file from a control host without the config file.
    2 Build an image ISO and installed/ bootstrapped an AIOSX node.
    
    Closes-Bug: 1914611
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I79362b00a25ca031a1fbbaa92476955e35477b73

commit 6226fa7a28ee271f92419bbeb62f2ee512e4b192
Author: Mingyuan Qi <mingyuan.qi@intel.com>
Date:   Tue Dec 8 05:52:41 2020 +0000

Add provision edgeworker playbook
    
    This playbook provisions edgeworker nodes to setup OS
    configurations as well as join the node to the Kubernetes
    platform.
    
    This playbook should be triggered manually after an edgeworker node
    added to sysinv. Make sure the edgeworker node has got the correct
    mgmt ip address from controller's dhcp server.
    
    1. Create an inventory file for edgeworker nodes on the active
    controller:
    
    tee ./edgeworker_inventory.yml << EOF
    all:
      hosts:
        localhost:
          ansible_connection: local
      children:
        edgeworker:
          hosts:
            <edgeworker hostname>:
              ansible_ssh_user: <admin username>
              ansible_ssh_pass: <ssh password>
              ansible_become_pass: <admin password>
              ansible_python_interpreter: <python path(e.g.
    /usr/bin/python3)>
            #<more edgeworker nodes>:
      vars:
        ansible_ssh_user: sysadmin
        ansible_ssh_pass: St8rlingX*
        ansible_become_pass: St8rlingX*
    EOF
    
    2. Trigger the playbook provision_edgeworker.yml:
    
    ansible-playbook \
    -i ./edgeworker_inventory.yml \
    /usr/share/ansible/stx-ansible/playbooks/provision_edgeworker.yml
    
    After the provisioning, the edgeworker node will be shown Ready in
    Kubernetes.
    
    Test:
    Edgeworker node OS: Ubuntu 18.04/20.04
    AIO-SX + edgeworker nodes: PASS
    AIO-DX + edgeworker nodes: PASS
    Standard + edgeworker nodes: PASS
    
    Story: 2008129
    Task: 40879
    
    Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>
    Change-Id: If189f915461da3ce79dea3a688a3cf5e59d8f12c
    Depends-On: https://review.opendev.org/c/starlingx/config/+/763918

commit 2a5e37cecfde7572331b0c7f13e8ea55400c5260
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Mon Feb 1 09:26:48 2021 -0500

Remove RPF config of tunl0 interfaces from kernel
    
    Calico project removed "strict" mode for RPF from kernel's IPv4
    configuration at version 3.12, moving back to former approach of
    setting iptables rules for security purposes (see
    https://github.com/projectcalico/felix/pull/2189). Thus, forcing
    disable of RPF at tunl0 interface as introduced by the fix of
    bug 1838801 isn't necessary anymore.
    This commit removes that forced writing of kernel configuration,
    so that calico-node pod is healthy after started, and asymmetrical
    routing still works as required.
    
    Implements: removal of postStart hook for Calico pods
    Closes-Bug: 1912807
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
    Change-Id: Iea8f4aa988269f99f5632fec2cf344297709eaef

commit adf542c16b71e681e47599b1e60a17311efeb6a6
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Jan 21 09:05:22 2021 +0200

B&R: Fix restore of ceph using mds
    
    Calling /etc/init.d/ceph without specifying a component applies an
    action to all components.
    After cephfs_platform_integ topic merged, ceph mds component is also
    used.
    Mds requires a monitor to be up before starting.
    Due to the fact that the ceph script was called without specifying the
    components mds the start action was applied to mds also and it was
    applied before starting the monitor, resulting in failure.
    
    The fix is to specify the monitor and osd components.
    Mds will be start when ceph is restarted a few tasks later.
    
    Plus minor ansible changes:
    - Use failed_when instead of ignore_errors
    - Renamed task
    
    Closes-Bug: 1912488
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I5388ddd32923bbd9acb205fa2d2adbc083346631

commit c62c3aa8890ef6c0d80ba8397ef64f5b0fb8c291
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Jan 18 15:49:21 2021 -0500

Bug fix: static image type error
    
    A bug was introduced by commit
    587c1ff2075c518f32bb93f14f66bcd23e542259 that the bootstrap of a
    distributed cloud system controller will fail due to the RVMC image
    is a string rather than a list. This commit converts it to a single
    member list before appending it to static images list.
    
    Tested by bootstraping a distributed cloud system controller.
    
    Change-Id: Ib2e0c6cc1913e0ee44f8b288004459c497ebedb8
    Closes-Bug: 1908100
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 6cf33738447ba9f65d0f5aec1507991676522a58
Author: Sabeel Ansari <Sabeel.Ansari@windriver.com>
Date:   Tue Dec 22 12:22:04 2020 -0500

Create deployment namespace during bootstrap
    
    A generic namespace is being created during bootstrap
    called 'deployment'. This namespace will be shared by
    various deployed platform resources including platform
    certificates.
    
    Change-Id: I48bca00cc882fae691786fd1e200e28bf126496b
    Story: 2007361
    Task: 41495
    Signed-off-by: Sabeel Ansari <Sabeel.Ansari@windriver.com>

commit 587c1ff2075c518f32bb93f14f66bcd23e542259
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Dec 14 11:55:32 2020 -0500

Upgrade: append additional images to the static images list
    
    Images specified in additional_local_registry_images at install time
    will not be upgraded after an upgrade is completed. This commit allows
    common/load-images-information load additional images list from
    /usr/share/additional-system-images.yml by default.
    
    In a distributed cloud system, the Redfish Virtual Media
    Controller(RVMC) image is can support remote install on Redfish
    configured hosts. This commit includes the RMVC image in the static
    images list if the host is a DC controller, enables
    download/push/update this image with other static images.
    
    Tested by installing and upgrading an AIODX central cloud with an
    AIOSX subcloud DC system.
    
    Partial-Bug: 1908100
    Change-Id: I1f927f876f4883a587098c61fbcaf408d65fdde4
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit a83a13a0e8c6923ea4e9f74797269aa9ee4bf02d
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Mon Jan 11 00:46:54 2021 -0500

Ensure docker proxies are defined
    
    System triggered playbooks such as upgrade-static-images,
    will fail to pull newer images from AWS ECR if either
    docker http or https proxy is not configured. This commit
    resolves the issue.
    
    Tests:
      - Fresh install with public registries
      - Fresh install with a private registry
      - Fresh install with AWS registry, http proxy configured
      - Fresh install with AWS registry, proxies not configured
          a. Perform controller swact
          b. Perform upgrade
    
    Closes-Bug: 1910951
    Change-Id: I93c2856210dde90a66bae823178e14f140f711f4
    Signed-off by: Tee Ngo <tee.ngo@windriver.com>

commit bbcec5f2d678a0f18c02eb514e274a6187c21e32
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Oct 14 15:56:39 2020 -0400

Configure SQL as helm storage backend
    
    Configmap is the default helmv2 storage backend to store
    release information but its 1MB resource limit prevents
    scaling up stx openstack worker nodes, so we want to use
    SQL as helm storage backend.
    
    Update armada overrides to start up tiller with SQL storage
    backend.
    
    Tested:
    - AIO-SX, AIO-DX, STD installation (IPv4 and IPv6)
    - apply stx-openstack
    - host-swact/lock/unlock
    - AIO-SX, AIO-DX upgrade with stx-openstack, stx-monitor
    - backup&restore with stx-openstack
    
    Closes-Bug: 1887677
    Depends-on: https://review.opendev.org/#/c/761647/
    Change-Id: I8ad7194973e8fc60ee2539dea5da67b15be1df4a
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 56e338e5c729242dd14a7ecd4d294d65ca069863
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Wed Dec 16 23:52:26 2020 -0500

Improve bootstrap config validation
    
    A misconfigured docker_registries when merged with the
    default docker_registries will lead to bootstrap failure
    with a misleading error that is difficult for the user to
    identify the root cause.
    
    Add tasks to ensure user provided registry keys are valid.
    
    Closes-Bug: 1908479
    Change-Id: I829496481708221a772a9242dc660a2462880116
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 820f347324d4fb7d17396d5d21f98eb2b674d23a
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Sat Oct 31 00:32:03 2020 +0800

Enable etcd with security setting.
    
    Add etcd client/server certificate generation process in
    ansible playbook and restart etcd after etcd security config
    enabled.
    
    Test status.
    1) Deployment test on both simplex and duplex  - PASS
       check the communication status between apiserver
       and etcd by kubectl command
       check the etcd status and configuration on controllers
    2) Switch active controller - PASS
       After switching, check the communication status between
       apiserver and etcd by kubectl command
       check the etcd status and configuration on controllers
    3) Lock/unlock of a simplex controller - PASS
    4) Backup/Restore test on simplex - PASS
    5) Spontaneous reboot of a simplex controller - PASS
    6) Enable secured etcd on simplex by script after deploy
       unsecured etcd. - PASS
    7) Backup test on Duplex - PASS
    8) Restore test on Duplex - PASS
    9) Re-installing a controller host on a duplex setup and
       then swacting to it - PASS
    
    Partial-Bug: 1894870
    
    Depends-on: https://review.opendev.org/#/c/760508/
    Change-Id: I88691b84c9acc2e27f0b783d7454a873d3490072
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 0df607a49b514f143a9198c6fbba1e9e846e6c4f
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Tue Jul 14 14:21:14 2020 -0400

Support upgrade to Helm v3 with containerized armada
    
    This provides an upgrades playbook for containerized armada
    to launch armada using Helm v3.
    
    This refactors common code from bringup_helm.yml so that it may be
    called from either bootstrap or upgrade.
    
    Story: 2007927
    Task: 40354
    Closes-Bug: 1906554
    
    Change-Id: I061006683252a8592f07c90d6d82bdc109418451
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 936881f954f8acb2ed0b69de47f576d49ed82d2d
Author: Cole Walker <cole.walker@windriver.com>
Date:   Mon Nov 30 17:27:56 2020 -0500

Check local pods only to prevent B&R timeout
    
    This fix reworks a kubectl wait command to only wait for ready pods on
    the local controller. This fixes an issue where restoring a cluster with
    many nodes can timeout during "Start wait for armada,
    calico-kube-controllers & coredns deployments to reach Available state".
    This timeout was happening because the kubectl command was waiting for
    pods on downed nodes to become ready for 30 seconds per pod. In cases
    where there were 5 or more unreachable pods, the async timeout value of
    120 seconds would be reached and the kubectl wait commands would be
    terminated before completion. This prevented subsequent Ansible tasks
    from completing and resulted in a parsing error during "Fail if any of
    the Kubernetes component, Networking or Armada pods are not ready by
    this time"
    
    The fix here adds --field-selector spec.nodeName=$(hostname) to the
    kubectl wait command and causes only pods on the running
    controller to be checked. This also ensures that the asyncronous tasks
    will always only take 30 seconds, regardless of the number of nodes in
    the cluser.
    
    Removed the behaviour where the wait time would scale up based on the
    number of nodes, as the return time of the async tasks is now always a
    fixed 30 seconds.
    
    Closes-Bug: 1905788
    
    Change-Id: Idb7b30891d4fd00901aaa69412ef4c59913e21f3
    Signed-off-by: Cole Walker <cole.walker@windriver.com>

commit 843e77819228d1298db192d68746f2591ac3e078
Author: Andy Ning <andy.ning@windriver.com>
Date:   Fri Dec 4 10:14:25 2020 -0500

Restrict permissions on docker registry certificate file
    
    It is noticed that docker registry certificate file
    (/etc/docker/certs.d/registry.local:9001/registry-cert.crt) has
    permission set to 644. This update changes its permissions to
    400 as required, by preserving the original permissions when it
    is copied over in ansible bootstrap.
    
    Change-Id: Ic85fa5fa2595f81d5cde6b0294eb7fbbd9c7dc63
    Closes-Bug: 1906844
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit d3e48692a5b28a2466ea9dbf8d0737bebd9df68e
Author: Cole Walker <cole.walker@windriver.com>
Date:   Tue Dec 1 16:50:41 2020 -0500

Lower negative cache TTL for coredns
    
    This change updates the coredns config map to lower the TTL for caching
    negative responses from 30 seconds down to 5 seconds. This will improve
    the response time for cases where a hostname lookup might occur before a
    given pod is created, resulting in a negative entry being cached and
    preventing pods from resolving the name for 30 seconds afterwards.
    
    The default cache size of 9984 items is unchanged, but must be
    explicitly defined in this configuration.
    
    The change also adds coredns to the upgrade-k8s-networking playbook to
    ensure that changes to coredns can be automatically deployed by
    sysinv-conductor.
    
    Closes-Bug: 1906870
    
    Change-Id: I7b44358508c32c8ca4e58b1e69d6232f1a61bfcf
    Signed-off-by: Cole Walker <cole.walker@windriver.com>

commit 1d7305eaf980ea258bcf9e2bb31406bf96ceb766
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Nov 18 01:11:33 2020 -0500

Remove the addition of identity to shared services
    
    Identity services are no longer a shared service in DC.
    This commit removes the addition of identity to the shared service
    list for subclouds.
    
    Change-Id: I3d9d0e4df1a41142cce1ce13d4bbf7d43a626909
    Partial-Bug: 1904675
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.