StarlingX

AIO-SX upgrade_platform playbook fails waiting for armada-api pod

Bug #1928141 reported by Dan Voiculeasa on 2021-05-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Dan Voiculeasa

Bug Description

Investigations showed that tiller container started executing commands before NDP finished.

Severity
--------
Critical: System/Feature is not usable due to the defect

Steps to Reproduce
------------------
Run an AIO-SX upgrade.
In fact scale the armada-pod in a loop
kubectl --kubeconfig=/etc/kubernetes/admin.conf scale deployment -n armada armada-api --replicas=0
kubectl --kubeconfig=/etc/kubernetes/admin.conf scale deployment -n armada armada-api --replicas=1

Expected Behavior
------------------
Write down what was expected after taking the steps written above

Actual Behavior
----------------
State what is the actual behavior

Reproducibility
---------------
Once, but tiller problem can be reproduced when running a script in a loop fir several hours.

System Configuration
--------------------
AIO-SX IPv6

Branch/Pull Time/Commit
-----------------------
Any April 2021 load or older

Last Pass
---------
Not relevant

Timestamp/Logs
--------------

2021-04-27 20:30:13,563 p=21107 u=sysadmin | TASK [bootstrap/bringup-essential-services : Fail if any of the Kubernetes component, Networking or Armada pods are not ready by this time] ***************************************************************************

2021-04-27 20:30:13,698 p=21107 u=sysadmin | failed: [localhost] (item={'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for the condition on deployments/armada-api'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on deployments/armada-api', u'ansible_job_id': u'168691713518.173381', u'stdout': u'', '_ansible_item_result': True, u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': False, u'_raw_params': u'kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=armada --for=condition=Available deployment armada-api --timeout=30s', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin': None}}, 'attempts': 6, u'delta': u'0:00:30.082435', 'stdout_lines': [], 'failed_when_result': False, '_ansible_no_log': False, u'end': u'2021-04-27 20:30:09.129848', '_ansible_item_label': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':
{u'namespace': u'armada', u'deployment': u'armada-api'}

, u'ansible_job_id': u'168691713518.173381', 'item': {u'namespace': u'armada', u'deployment': u'armada-api'}, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/168691713518.173381', '_ansible_ignore_errors': None, '_ansible_no_log': False}, u'start': u'2021-04-27 20:29:39.047413', u'cmd': [u'kubectl', u'--kubeconfig=/etc/kubernetes/admin.conf', u'wait', u'--namespace=armada', u'--for=condition=Available', u'deployment', u'armada-api', u'--timeout=30s'], u'finished': 1, u'failed': False, 'item': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'168691713518.173381', 'item':
{u'namespace': u'armada', u'deployment': u'armada-api'}

, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/168691713518.173381', '_ansible_ignore_errors': None, '_ansible_item_label': {u'namespace': u'armada', u'deployment': u'armada-api'}}, u'rc': 1, u'msg': u'non-zero return code', '_ansible_ignore_errors': None}) => {"changed": false, "item": {"ansible_job_id": "168691713518.173381", "attempts": 6, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=armada", "--for=condition=Available", "deployment", "armada-api", "--timeout=30s"], "delta": "0:00:30.082435", "end": "2021-04-27 20:30:09.129848", "failed": false, "failed_when_result": false, "finished": 1, "invocation": {"module_args": {"_raw_params": "kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=armada --for=condition=Available deployment armada-api --timeout=30s", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": {"ansible_job_id": "168691713518.173381", "changed": true, "failed": false, "finished": 0, "item":
{"deployment": "armada-api", "namespace": "armada"}

, "results_file": "/root/.ansible_async/168691713518.173381", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2021-04-27 20:29:39.047413", "stderr": "error: timed out waiting for the condition on deployments/armada-api", "stderr_lines": ["error: timed out waiting for the condition on deployments/armada-api"], "stdout": "", "stdout_lines": []}, "msg": "Pod {u'namespace': u'armada', u'deployment': u'armada-api'} is still not ready."}

2021-04-27 20:30:13,715 p=21107 u=sysadmin | PLAY RECAP ************************************************************************************************************************************************************************************************************

2021-04-27 20:30:13,715 p=21107 u=sysadmin | localhost : ok=432 changed=239 unreachable=0 failed=1

Test Activity
-------------
Testing upgrades

Workaround
-------------
Reinstall the iso and retry upgrade_platform.yaml

Tags:

CVE References

Dan Voiculeasa (dvoicule) on 2021-05-11

Changed in starlingx:
assignee:	nobody → Dan Voiculeasa (dvoicule)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-11: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/790863

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-11: Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/790864

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-12: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/790863
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/36451c99ce76e76084ad5c68e4954bf347e8c0b7
Submitter: "Zuul (22348)"
Branch: master

commit 36451c99ce76e76084ad5c68e4954bf347e8c0b7
Author: Dan Voiculeasa <email address hidden>
Date: Tue May 11 16:24:26 2021 +0300

Add helm sql database ip to armada overrides

This will be used by tiller container to check that the container
networking is properly set up.

    Partial-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I177bb628497611eb64472291a04d635856c26590

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-12: Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/790864
Committed: https://opendev.org/starlingx/integ/commit/764cac1642a8820d169576da3d8d886449d3cf73
Submitter: "Zuul (22348)"
Branch: master

commit 764cac1642a8820d169576da3d8d886449d3cf73
Author: Dan Voiculeasa <email address hidden>
Date: Tue May 11 17:04:01 2021 +0000

Armada: Fix tiller stuck connecting to postgres database

    Tiller may start executing before IPv6 network is fully initialized.
    This will result in tiller not being fully functional.
    The liveness probe will detect that tiller didn't start properly and
    restart it. But this might happen an unlimited number of times in a row.

    Wait until ping is succesful to the ip of the postgres database.
    This ensures that networking finished setting up.
    Credits to Cole Walker <email address hidden> for proposing the
    idea.

    Depends-On: I177bb628497611eb64472291a04d635856c26590
    Closes-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I9c5be3f30fad2650e6aa53fb80ef44f7798813ed

Ghada Khalil (gkhalil) on 2021-05-14

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.6.0 stx.update

Revision history for this message

Frank Miller (sensfan22) wrote on 2021-05-17:

Added stx.5.0 tab as this is impacting stx.5.0 and recommendation is to cherrypick the fix for this LP to the r/stx.5.0 branch.

tags:

added: stx.5.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-17: Fix proposed to integ (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/integ/+/791778

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-17: Fix proposed to ansible-playbooks (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791779

Ghada Khalil (gkhalil) on 2021-05-17

tags:

added: stx.cherrypickneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-17: Fix merged to ansible-playbooks (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/791779
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/64193efad88529418fa14a735cdb628fb3c6b3ec
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 64193efad88529418fa14a735cdb628fb3c6b3ec
Author: Dan Voiculeasa <email address hidden>
Date: Tue May 11 16:24:26 2021 +0300

Add helm sql database ip to armada overrides

This will be used by tiller container to check that the container
networking is properly set up.

    Partial-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I177bb628497611eb64472291a04d635856c26590
    (cherry picked from commit 36451c99ce76e76084ad5c68e4954bf347e8c0b7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-17: Fix merged to integ (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/791778
Committed: https://opendev.org/starlingx/integ/commit/136f63995268d8c41c5cf651ec97e37dc156f49e
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 136f63995268d8c41c5cf651ec97e37dc156f49e
Author: Dan Voiculeasa <email address hidden>
Date: Tue May 11 17:04:01 2021 +0000

Armada: Fix tiller stuck connecting to postgres database

    Wait until ping is succesful to the ip of the postgres database.
    This ensures that networking finished setting up.
    Credits to Cole Walker <email address hidden> for proposing the
    idea.

    Depends-On: I177bb628497611eb64472291a04d635856c26590
    Closes-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I9c5be3f30fad2650e6aa53fb80ef44f7798813ed
    (cherry picked from commit 764cac1642a8820d169576da3d8d886449d3cf73)

Bill Zvonar (billzvonar) on 2021-05-18

tags:

added: in-r-stx50
removed: stx.cherrypickneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to ansible-playbooks (f/centos8)

#10

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-31: Fix proposed to integ (f/centos8)

#11

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/integ/+/793754

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Fix proposed to ansible-playbooks (f/centos8)

#12

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-02: Change abandoned on ansible-playbooks (f/centos8)

#13

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792195

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-03: Fix merged to ansible-playbooks (f/centos8)

#14

Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

Revert "Restore host filesystems with collected sizes"

This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

This change adds a new task which brings
the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

Support boo...

Reviewed:  https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Sat May 22 15:48:19 2021 +0000

Revert "Restore host filesystems with collected sizes"
    
    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.
    
    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.
    
    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <angie.wang@windriver.com>
Date:   Fri May 21 21:28:02 2021 -0400

Ensure apiserver keys are present before extract from tarball
    
    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.
    
    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <don.penney@windriver.com>
Date:   Thu May 20 23:09:07 2021 -0400

Update SX to DX migration to wait for coredns config
    
    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <daniel.safta@windriver.com>
Date:   Wed May 19 09:08:16 2021 +0000

Fixed missing apiserver-etcd-client certs
    
    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}
    
    This change adds a new task which brings
    the certs from /etc/kubernetes/pki
    
    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <daniel.safta@windriver.com>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Wed May 19 16:01:27 2021 -0500

Support bootstrap replay with networking changes
    
    Currently bootstrap playbook replay will fail if the management or
    cluster host networks are changed. To resolve this a couple of changes
    are needed:
    
    * Restart the sysinv agent and wait until it is ready. The sysinv agent
      uses the current management ip for the rabbitMQ connection and
      applying runtime manifests. The process needs to be restarted to
      resync that data.
    
    * Copy the etcd certs to the /opt/platform on replay. The etcd-server
      certs are regenerated on replay. When the cluster host network changed
      the SAN in the certs under /opt/platform were out of date resulting in
      kube-apiserver failures on controller-0 unlock.
    
    Closes-Bug: 1925668
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I228321a2540a0024cd217ed844feb54be9ae3b29

commit 41ada83e4f4486d0795eed3e7a8bbe4227ee88d8
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Wed May 19 12:53:26 2021 -0500

Bug fix: update barbican external id if the project id changes
    
    The previous commit
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/790824
    unexpectedly moves the update_barbican_project_external_id out of the
    scope of the project id changes. Further, the old_id['id'] is
    duplicated with the current_project_id, it will be undeclared in the
    case of the project id not changed (by deleting the subcloud in
    powered off status and re-add it with --migrate option). This commit
    deletes the duplicated variable and moves the
    update_barbican_project_external_id to the correct scope.
    
    Tests:
    1 Delete a subcloud from a central cloud and add it back with the
    migrate option.
    2 Migrate a subcloud successfully to a different central cloud.
    3 Migrate a subcloud successfully with an extra nonlocal user to a
    different central cloud.
    
    Partial-Bug: 1928139
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I8acf7bcea5d8e77b92877427ccd40b95e8b3515e

commit ac0c5d51a8708c8c056a736ff11ce1a0b1550c4f
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu May 13 19:53:41 2021 +0300

Create the pod_max_pids service parameter
    
    This adds a default entry to service parameters.
    
    Create the default entry taking into consideration the most hungry of
    the optional StarlingX apps. The user is free to modify the value as
    desired, using 'system service-parameter-modify'.
    
    Same can be created by the user using 'system service-parameter-add',
    but this helps the user by being transparent in service-parameter-list.
    
    If this service parameter was missing an entry, then no hieradata
    variable would have been generated, so puppet would have used
    a predefined value.
    
    Partial-Bug: 1928353
    Depends-On: I74fcf2bd405c2a3811a4f27a55b28c0d001430e1
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I707ddc4ca67595fbf809c6ffc15ecd4fb21f4661

commit 64bf73c85c5de0737c8a1cf967b6b251288ee424
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue May 11 16:22:12 2021 +0300

Enable kubelet support for pod pid limit
    
    This protects the system before the unlock. This has the most meaning
    during the restore procedure, when the system is unprotected until
    unlock (until puppet generates the config file containing protection).
    
    Partial-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I09c4d4f494bc113ae8b439256655476e03b54b0e

commit cfc719b82a6f1651a2b3950b316244f907d58491
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon May 17 17:11:12 2021 -0400

Configure kubeadm to not apply the default taint
    
    The taint "node-role.kubernetes.io/master:NoSchedule" needs
    to be removed from master node so that pods can be scheduled
    on it. This is handled by a bootstrap task. However, issue
    was seen that the default taint was not removed during bootstrap
    that causes armada pod fails to be scheduled on controller-0.
    This happens on one of the subcloud when bootstrapping a batch
    of 50 subclouds.
    
    Add configuration in kubeadm to not apply the default taint
    at the beginning so it doesn't need to be removed afterwards.
    
    Tested AIO-SX, DX upgrade and a batch of 50 subclouds deployment
    
    Change-Id: I543280ddd55ec94ccf0586dc07877349baa06bdd
    Closes-Bug: 1928722
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 45c74db36670e9cba3475e598a4655490b744cee
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu May 13 10:46:02 2021 -0500

Cleanup of subcloud rehoming playbook
    
    As the ansible log is accessible for all the users, we don't want to
    expose the keystone IDs and passwords in a subcloud's ansible log.
    This commit hides the keystone IDs and passwords in the ansible log.
    
    This commit also removes un-used facts and the rehome_in_progress
    flag.
    
    Tested with successfully re-homed a SX and a DX subcloud in virtual
    box env.
    
    Story: 2008774
    Task: 42462
    
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I6c2e83ba5b4923e9d7c82ccf94165608739e59e1

commit 0ca19e0870e06b601553ea2a9d9e1cfc0367d75f
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Tue May 11 16:02:49 2021 -0500

Clean non-local keystone user during migrating keystone IDs
    
    During the progress of re-homing a sub cloud to a new central cloud,
    insert a new user ID will be failed if there's a duplicated non-local
    user in keystone database. This commit deletes the non-local user with
    the same user id before inserting a new keystone local user into the
    database.
    
    Test:
    1. Add a non-local user in keystone database, and this user should
    have a same user ID as the new central cloud.
    2. Successfully re-home this subcloud to the new central cloud.
    3. Successfully re-home a subcloud without the duplicated user in
    keystone database.
    
    Partial-Bug: 1928139
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: Iecaac3400cb9362acc63686ae9470cf01f7eb2e1

commit 275e046280a71b0c5c8c637fdc7065d4e2e19aee
Author: Isac Souza <IsacSacchi.Souza@windriver.com>
Date:   Thu May 13 13:07:56 2021 -0300

Fix deletion of temp dir when backup fails
    
    When the /opt/backups disk is full and the backup playbook
    is executed, the Ansible task fails to create a temp directory.
    After the failure the cleanup code is executed but if the temp
    dir was not created, it fails because tempdir.path will be undefined.
    This second failure aborts the cleanup code and the
    "System Backup in progress." will stay on indefinitely, requiring
    manual intervetion.
    
    Tested by executing the backup procedure when /opt/backups
    is full.
    
    Closes-Bug: 1928365
    Signed-off-by: Isac Souza <IsacSacchi.Souza@windriver.com>
    Change-Id: I76c852b65bde4b40a22bcb9be3a81776ade86d15

commit 02b36df15ba42c4346b34b4ddce29d95b4c7fc69
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Thu May 13 12:17:06 2021 -0400

Enable ssh connection retries in Ansible playbooks
    
    In this commit:
     - Allow ssh retries in the playbooks in case of unreachable
       failures that are observed in batch subcloud deployment or
       upgrade where the system controller has low network bandwidth.
     - Increase ssh timeout to accomodate slow sudo response when
       ldap service is not yet available for the subcloud.
     - Allow docker login and images pull retries in case docker
       or docker proxy throws an exception due to slow response
       (e.g. oam network is overwhelmed).
    
    Closes-Bug: 1928357
    Change-Id: Ibc6155671b20a01340b66270c3c402174d34ab9e
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 20e44c6dd71758f89bbfe88c6204e777c17deb5d
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Tue Apr 27 01:40:16 2021 +0800

Fix bootstrap replay failure when changing mgmt subnet
    
    After mgmt subnet is changed, we use previous controller_0
    address for etcd puppet apply to avoid resycing an nonexistent
    hieradata file.
    
    Change-Id: Ie31c48153af30df240237013dd51bfffea5213cd
    Closes-Bug: 1925668
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 7228f60e146286d0948ac67bee159a7b8c54f704
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue May 11 18:56:03 2021 +0300

Fix restore user images playbook
    
    When running restore_user_images playbook,
    after the images will be uploaded to the local registry,
    the docker images stored in the local docker filesystem
    will be deleted by a python script called by the playbook.
    Sometimes this script will try to delete images
    that are not present into the local docker filesystem and
    will throw some errors.
    We fix this by adding a check before trying to delete an image.
    
    Also, updated the
    /playbooks/roles/common/push-docker-images/files/download_images.py
    file to do the same checks since it does similar steps
    and can also fail when deleting
    the images from the local docker cache.
    
    Closes-Bug: 1928092
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I1be45b2669ad1ab209aa49f050106dd1a7759cee

commit d5460198dc0310a80580537fd8df76ae00e17f02
Author: Robert Church <robert.church@windriver.com>
Date:   Wed May 12 22:45:38 2021 -0400

Adjust armada's tiller container liveness probe
    
    With the liveness probe update in the armada helm chart to test the
    connectivity to the postgres backend, adjust the periodSeconds and
    failureThreshold to align with the minimum swact time to be expected for
    postgres switching from one controller to another.
    
    Reviewing logs from various H/W labs it appears that average postgres
    swact time ranges from 9s-20s, with the mean ~15s.
    
    Times can be observed with:
    2021-05-09T13:32:24.475 controller-1 OCF_pgsql(postgres)[396293]: info
                                         INFO: server shutting down
    2021-05-09T13:32:33.423 controller-0 OCF_pgsql(postgres)[147541]: info
                                         INFO: server starting
    
    Set the periodSeconds to 4 and the failureThreshold to 2 so that if the
    postgres server is not accessible, the tiller container will be
    restarted within the 9s minimum swact time. This will ensure that the
    next time tiller is required by Armada or used by the helmv2-cli that
    the connection to postgres backend has been re-established.
    
    Change-Id: I7454a737771d9a608d2fe69c5136d37da022007e
    Depends-On: https://review.opendev.org/c/starlingx/integ/+/791092
    Related-Bug: #1917308
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 9d4203b7e0d4b321fcccf14e69367573ef5d5093
Author: Tao Liu <tao.liu@windriver.com>
Date:   Wed May 12 20:30:37 2021 -0400

Use async_timeout for kubectl wait timeout
    
    Data migration failed for 5 subclouds during 30 subcloud upgrade.
    This failure was caused by timeout waiting for Kubernetes component,
    Networking or Armada pods to be ready.
    
    The async_timeout is set to 120s in restore mode, but the pod wait time
    is still set to 30s. This update changes the pod wait time to be the
    same as async_timeout.
    
    Tested on DC-2 with 50 subcloud parallel upgrade
    
    Closes-Bug: 1928252
    
    Signed-off-by: Tao Liu <tao.liu@windriver.com>
    Change-Id: If2b21d0cd2e0de9e84869323a43d4a249d031132

commit 36451c99ce76e76084ad5c68e4954bf347e8c0b7
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue May 11 16:24:26 2021 +0300

Add helm sql database ip to armada overrides
    
    This will be used by tiller container to check that the container
    networking is properly set up.
    
    Partial-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I177bb628497611eb64472291a04d635856c26590

commit 86fbafec14744d6630aff6e6ac2bb165c5be2be8
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Tue May 11 14:58:29 2021 -0400

Fix: Upgrade does not replay correctly after activate failure
    
    When the etcd runtime manifest fails during upgrade activate,
    the activation fails as expected. However when the upgrade-activate
    is attempted again, the etcd upgrade playbook is not re-run.
    Depending on when the manifest fails, this could result in secure
    etcd remaining disabled on the system.
    
    This commit makes it possible to replay the etcd upgrade playbook,
    thus solving the issue.
    
    Change-Id: I7f453d9040916381519ac96ed4567ec5fb6e7a8d
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>
    Closes-Bug: 1928130

commit 2dfadea581d1ac9955a7921bd2730e36a152255c
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu Mar 11 11:12:05 2021 -0500

Add a rehome subcloud playbook
    
    This commit creates a rehome_subcloud playbook to migrate an existing
    subcloud from the original central cloud to a new central cloud. This
    playbook is expected to be remotely play from the new central cloud
    system controller based on the subcloud's overrides. This playbook
    updates the system controller network info in the subcloud, migrates
    the keystone data and updates the subcloud's ca and certs. After
    running this playbook and unlock the subcloud controllers, the
    subcloud can be discovered online in the new central cloud, and
    can be managed and brought in-sync from the new central cloud.
    
    Test:
    Successfully migrate a AIOSX and a AIODX subcloud to the new central
    cloud with this playbook.
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/784767
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/785977
    Depends-On: https://review.opendev.org/c/starlingx/config/+/786638
    Depends-On: https://review.opendev.org/c/starlingx/config/+/787213
    
    Story: 2008774
    Task: 42152
    Change-Id: Iaa6699951e855c76602bd43f71d42e64e298c786
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit d23b932208b778ec73cc51bccf460310f16ceaad
Author: Bin Qian <bin.qian@windriver.com>
Date:   Mon May 3 15:32:06 2021 -0400

Ensure n3000-opae image cache is not deleted
    
    Skip removing n3000-opae image cache. This image is expected to be
    available to reset n3000 fpgas before docker local registry is ready.
    
    Closes-Bug: 1927000
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: I3372d89c48b394c8cac6b9da06d2528bb6afa803

commit 255488739efa4ac072424b19f2dbb7a3adb0254e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Thu Apr 29 16:37:21 2021 +0300

Restore host filesystems with collected sizes
    
    Since https://review.opendev.org/c/starlingx/ansible-playbooks/+/784860,
    the host filesystems(backup, docker, kubelet, scratch) are
    no longer resized in ansible at restore and they are not using the
    collected sizes from the backup archive. Puppet will try to
    resize them when unlocking but this will generate some errors.
    
    The solution is to create the host filesystems with the
    correct sizes at restore. The sizes are taken from the
    backup archive.
    
    Closes-Bug: 1926591
    Change-Id: Id670408a518e4a1e3fc75a668eea42d26a972d66
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit bc0fba6bbbd0182c4886e5a3ccbfc2d0973cfd70
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 28 11:52:36 2021 -0400

Remove restore subcloud admin endpoint certificate from config
    
    This change is to exclude the admin endpoint certificate from restore
    to config directory. The admin endpoint certificate is stored in k8s
    (backup) and restore as part of k8s restore. Sysinv will generate it
    into hieradata from k8s secret and puppet will genereate the pem for
    haproxy.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/786666
    Partial-Bug: 1923510
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: Iae8fb9c53e0aa6797a25b872adb0c99636c4243a

commit e88c9290a3ca2d947828a1f4e988f9fe61a1a623
Author: Melissa Wang <melissa.wang@windriver.com>
Date:   Mon Apr 26 17:45:40 2021 -0400

SX to DX migration: Check network interface config
    
    This change adds a semantic check to ensure that the cluster-host
    and management networks are not configured on loopback before
    allowing sx-to-dx migration.
    
    Task: 42375
    Story: 2008587
    
    Change-Id: I87326db222ffe9eb8bf23b69d17f676abc7c242d
    Signed-off-by: Melissa Wang <melissa.wang@windriver.com>

commit e25439d49d127779f9ab32650a4a51027242884b
Author: Bin Qian <bin.qian@windriver.com>
Date:   Wed Apr 14 14:53:56 2021 -0400

Remove creating admin endpoint cert in subcloud bootstrap
    
    This change removes creating admin endpoint cert in subcloud
    bootstrap.
    The admin endpoint cert is generated in manifest at the time when
    the controller node is unlock the first time. The cert data is
    retrieved directly from k8s secret data (where cert-manager is
    responsible to maintain it and keep it up to date).
    
    Partial-Bug: 1923510
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/786666
    Signed-off-by: Bin Qian <bin.qian@windriver.com>
    Change-Id: Ie6a5c8fe159efcdebdb4c81666e981772408b82c

commit 5b286734637ca6f503a62131b309829d2f308fed
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Mon Apr 5 17:01:48 2021 -0500

Remove drbd resize actions from bootstrap playbook
    
    drbd resize actions can cause failures during unlock. Avoid all drbd
    resize actions during the bootstrap playbook. Pass the correct
    filesystem sizes to the bootstrap manifest. Use the collected drbd
    filesystem sizes during the restore/upgrade bootstrap manifest.
    
    Closes-Bug: 1920245
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I235575ad40ba84298d3db5b39ce7861a143c78a8

commit 05d3d8c21f5017b26466bfd1cdd1f1e7accf266f
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Apr 12 10:35:57 2021 -0400

Fix kube-apiserver pod removal
    
    This commit adds rolebinding configuration to bind
    the "privileged-psp-user" role to the kubelet user.
    It fixes the issue where the kube-apiserver pod does
    not get recreated after enabling PodSecurityPolicy
    plugin. With this fix we make sure to allow the apiserver
    pod creation by granting permission to the kubelet user
    to create that pod.
    
    Closes-Bug: 1881605
    
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: Ibdf6d4cacf2ce83dfa744455dac460027b2a6e47

commit 802afc1c23d26f44bd7954a5ec038c5badddcaa3
Author: Andrei Grosu <andrei.grosu@windriver.com>
Date:   Mon Feb 8 22:28:49 2021 +0000

Inform conductor of ansible backup actions.
    
    Let apps run custom code around backup actions inside the playbook.
    
    Story: 2007960
    Task: 40769
    Depends-On: I0ebab45f4846cbcd25fecac6bf99195d9047eb8a
    
    Signed-off-by: Andrei Grosu <andrei.grosu@windriver.com>
    Change-Id: I61156db05970aa03c96ddc8533fdd4f4a680b334

commit f6cd32b82bc01c9f1b9f53d1754aac6a45e51643
Author: Don Penney <don.penney@windriver.com>
Date:   Fri Mar 26 11:23:51 2021 -0400

Copy default-registry-key to deployment namespace
    
    After creating the deployment namespace, copy the default-registry-key
    from the kube-system namespace into it for use by pods running in that
    namespace.
    
    Story: 2007361
    Task: 42186
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I3b431c74295cc0099f00814ee28709b1b4c56c8c

commit d61c82f555034f104e1bee8a83bb19ad448012b2
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Wed Mar 31 23:12:43 2021 +0800

Change CN of etcd CA since etcd will reuse kubernetes CA
    
    Basic test pass on simplex.
    
    Closes-Bug: 1921511
    
    Change-Id: If3a7cca4a03b05ac5eb61f7f579449a7393c1644
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 8a7c871a8d7c8843b404e6779eccd2e483293c2a
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Mon Mar 22 15:32:48 2021 -0400

Add default psp configuration to psp-policies.yaml
    
    Added ClusterRoleBindings configuration in psp-policies.yaml to
    enhance the current capabilities.
    The new configuration gives users the ability that by default,
    they can create pods and deployments with “restricted” capabilities
    in their namespaces.
    When psp policies are applied at bootstrap time, all tenants/users
    have at least restricted capabilities and access rights will be
    added as needed, reducing this way the manual configuration
    required.
    
    Closes-Bug: 1885716
    
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: I6118efd6832836829d31aa84b2b4305d5a1f24c4

commit f7d5491e404ba2854105278f6b9e2883b52a5206
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Tue Aug 25 12:16:22 2020 -0400

Remove CPU resource requests from platform pods
    
    In order to allow applications to make use of all "application" cores,
    we need to remove all CPU resource requests from "platform" pods.  We tried
    to make this work by modifying Kubernetes itself to track platform pod resources
    separately, but it would have required changes to kube-scheduler, which would
    involve building a custom kube-scheduler container image.
    
    Some platform pods are created by "kubeadm" during early init and are handled
    via a kubernetes code change.  For calico, multus, and sriov we modify the
    templates used to create the pods and remove any reference to CPU resources.
    
    Normally this would mean that they could get throttled by application pods, but
    since they're running on the platform cores this shouldn't be a problem.
    
    Story: 2008760
    Task: 42170
    Change-Id: Ibf73bd0d105e4040f02a4114afbb31d131fc9585
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 41de2e52db9985d84397d7ba56a59bbeaa9cf88f
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Fri Mar 19 01:00:10 2021 +0800

Fix bootstrap replay fails due to not changing etcd config.
    
    When replay bootstrap, we still need to apply etcd puppet manifest.
    Test PASS for below cases
    1）System bring up with bootstrap replay
       - non cluster host network reconfiguration
    2）System bring up with bootstrap replay
       - cluster host network change
    3）AIO simplex backup and restore
    
    Closes-Bug: 1918943
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>
    Change-Id: Ib892776f4b2949f9255fc4725add1f0e362956a9

commit fbd6a848d53ae6de9703466215fb85df2485ab0b
Author: Scott Little <scott.little@windriver.com>
Date:   Thu Mar 25 11:35:34 2021 -0400

Set SW_VERSION 21.05
    
    Prep for the StarlingX 5.0 release.
    SW_VERSION uses YY.MM format.
    
    Story: 2008055
    Task: 42113
    Depends-On: https://review.opendev.org/c/starlingx/utilities/+/783042
    Signed-off-by: Scott Little <scott.little@windriver.com>
    Change-Id: Ic718ed214dfdb6eb8fcfffcadbbeadfbb8f6d052

commit 2ffd9c4bff6bd736f35cc1fb2fd3b0e25c9ef8f2
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Fri Mar 19 10:36:55 2021 -0500

drbd filesystems not resized during bootstrap
    
    Remove drbd resize_result check on resize2fs operation. Both operations
    should run when requested. These commands will return 0 when the disk is
    resized and will return 0 if the disk is already correctly sized. Any
    non-zero return code should fail the playbook.
    
    Closes-Bug: 1920245
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: I946f63b2886b5377494e658a59586005c27ec2d2

commit 8c2580cc85b8e7a2783ece895d7c2c476db81990
Author: Don Penney <don.penney@windriver.com>
Date:   Wed Mar 10 12:13:56 2021 -0500

Introduce SX to DX migration playbook
    
    This commit introduces a migrate-subcloud.yml playbook that the user
    can run, with an overrides file to provide config values, to perform
    the migration steps for a subcloud. Once the migration playbook has
    been applied and the subcloud has recovered from the unlock (performed
    by the playbook), the second controller can be installed and
    configured.
    
    Story: 2008587
    Task: 41743
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776536
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: I1d8c1219694147baaabb183ef3debe1715aaf153

commit 64feca44b59284b07d42e7100fb90b22297c0ce0
Author: Carmen Rata <carmen.rata@windriver.com>
Date:   Fri Mar 5 15:18:46 2021 -0500

Use jinja2 template for containerd config.toml
    
    File "/etc/containerd/config.toml" is set up with ansible bootstrap
    using an erb template managed by puppet.
    We replace the erb template with a jinja2 template to improve ease
    of use and reduce the need for complex regular expressions.
    
    Closes-Bug: 1892768
    Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
    Change-Id: I93601321d4f554d27bb9457bcff99428351cbefd

commit df1f8e381d951a2ab6c52a32eb2467817795e083
Author: Suvro Ghosh <suvrojeet.ghosh@windriver.com>
Date:   Thu Mar 11 10:38:57 2021 -0500

Adding force flag to purge task
    
    This flag will allow ansible replay
    
    Story: 2007960
    Task: 42039
    Signed-off-by: Suvro Ghosh <suvrojeet.ghosh@windriver.com>
    Depends-On: If68d66d799addcd996da4b146d092c855b455aa3
    Change-Id: I93821965184d95a00fddd3398a1c214e3d730efa

commit ed7314d4c4c1f9d6f18dc6c144dd4b6cdf67b66e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Mon Mar 8 14:36:54 2021 +0200

Fix bootstrap playbook when initializing kubernetes
    
    The etcd endpoint in the kubedeadm file is different than
    the endpoint defined in the etcd config file.
    This is because in the kubeadm file, the etcd endpoint is equal to
    the 'cluster_floating_address' variable. And in the etcd config
    file the 'default_cluster_host_start_address' variable is used.
    These 2 variables can be different when the
    'cluster_host_start_address' variable defined in the localhost.yml
    differs from the first address of the 'cluster_host_subnet'.
    
    The solution is to use 'cluster_floating_address' in both cases
    because this variable is defined in the following way:
    
    cluster_floating_address: "{{ address_pairs['cluster_host']['start'] }}"
    will be different than the 'default_cluster_host_start_address': "{{
    (cluster_host_subnet | ipaddr(1)).split('/')[0] }
    
    So it will use 'default_cluster_host_start_address' when the
    'cluster_host_start_address' is not defined.
    
    Closes-Bug: 1918130
    Change-Id: I8fecc1e5e54b5a9a9a72a54c069f79f5f2d434ba
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit ff5c9327c1990cb2bf6454a1a156669187dc056b
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Feb 24 05:45:30 2021 +0000

Remove container runtime interface (CRI) placeholder
    
    Change-Id Ib1dd5bd (stx-puppet) and Icc5fd16 (config) add
    support to set CRI entries for kubernetes runTimeClass.
    
    This is not needed during the initial bring-up of services
    at bootstrap time and produces incorrect config.toml causing
    bootstrap failure. Therefore, it is removed from the initial
    config.toml generated during bring up of essential services.
    
    Story: 2008434
    Task: 41928
    
    Depends-On: https://review.opendev.org/c/starlingx/config/+/776220
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/776223
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Id124653329da0ba9f990e5bba6b53faa3c88fa86

commit ae898b781539f71ea009cefec7602ed62335741b
Author: Teresa Ho <teresa.ho@windriver.com>
Date:   Thu Feb 11 13:22:39 2021 -0500

Create device_images bind directory
    
    The device images are stored in the drbd filesystem
    (/opt/platform/device_images) in the active controller.
    In order to allow the other worker hosts to retrieve the device images
    from the active controller over lighttpd, the directory
    /www/pages/device_images is created as a bind mount of the drbd
    directory.
    
    Tests performed on the following systems:
    AIO-DX, AIO-DX plus compute, Standard 2+1
    DC with AIO-DX plus subcloud
    DC with Standard subcloud
    
    Story: 2007875
    Task: 41878
    
    Change-Id: I00c75767543d3840c466df887e9f16ba75a5386d
    Signed-off-by: Teresa Ho <teresa.ho@windriver.com>

commit 9814ec0490506f222326396368f34e278ae52a0d
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Mon Feb 22 23:07:34 2021 -0500

Update Ansible bootstrap address validation
    
    The method to validate bootstrap address in previous
    commit (9c62c83536b737e731b140c95be37d74769989ff) is not
    reliable for IPv6. This commit fixes it.
    
    Task: 41800
    Story: 2008573
    Change-Id: Ibadf36e7f6c1ec31ca47514802991c92959fd138
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 9cee892563c68154714bcb7a7e173dcc13b6b237
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Feb 24 14:27:14 2021 -0500

Update the minimum root disk size to be aligned with sysinv
    
    Change-Id: I03c8b31ab76ce8a2b8534677910763150ec1d9c0
    Closes-bug: 1916797
    Depends-On: https://review.opendev.org/c/starlingx/config/+/777465
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit f2d20c15bbda992bec707b9dd3529f5f1fd53b83
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Tue Feb 2 18:09:33 2021 +0200

Fix restoring dc-vault on a central controller
    
    At this moment, when we do a restore procedure on a
    DC system controller, the /opt/dc-vault directory will be
    created under "/" filesystem. It should be created on
    a separated filesystem, but that filesystem is available
    only after an unlock of the controller.
    
    The proposed solution is to create an additional restore
    playbook for the dc-vault that will be manually run after
    unlocking controller-0. The backup playbook will create
    an additional archive with the contents of dc-vault, and
    the dc-vault directory will be removed from the platform
    backup.
    
    The new playbook will be used like this:
    
    ansible-playbook
    /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e
    "ansible_become_pass=Li69nux*" -e "admin_password=Li69nux*" -e
    "initial_backup_dir=/home/sysadmin" -e
    "backup_filename=localhost_dc_vault_backup_2021_02_02_11_46_09.tgz"
    
    Closes-Bug: 1914258
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I8fdd5b678e2296cd0ce98ea4dd91e2988beb200f

commit 3babc1eed3ba861c652f082655fa284992be0859
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Fri Feb 19 16:45:22 2021 +0200

B&R: Fix registry backup generated when it should not
    
    Incorrect ansible variable evaluation results in the a backup file
    generated when it is required not do so.
    
    Fix the evaluation.
    
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Depends-On: I4644784ea4164134f163d218e69dc4ceb148985a
    Closes-Bug: 1916246
    Change-Id: I2a31dcda55137a668b2e82b9a938535bdf623656

commit 151b885dc733d5094fce474ae042ee2f2ceae49e
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Feb 19 16:21:23 2021 +0200

B&R: Fix backup hangout on IPv6 systems
    
    IPv6 addreses must be enclosed in square brackets when they are
    followed by a port number.
    
    At the "[backup/backup-system : Create etcd snapshot]"
    step, the etcd IPv6 endpoint is not wrapped in square brackets,
    so the command hangs indefinitely.
    
    We fix this by using the 'ipwrap' ansible filter which will
    wrap the address in [] brackets if it's an IPv6 one.
    
    Closes-bug: 1916053
    Change-Id: If40ed59f4e44c9f877aaefe87f6211a3e83ddfee
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit 71db4f1aa82a36c360ada97a6d8d989a736e1133
Author: Steven Webster <steven.webster@windriver.com>
Date:   Wed Feb 17 13:54:14 2021 -0500

Uprev SR-IOV CNI image
    
    This commit uprevs the SR-IOV CNI image to pick up a few bug
    fixes.  Specifically, this commit will allow rate-limiting
    configuration on a VF to be retained after the VF has been
    used by a pod (and pod subsequently deleted).
    
    Testing:
    
    NICs:
    Ethernet Controller X710 for 10GbE SFP+
    Mellanox MT27700 Family [ConnectX-4]
    
    Functional:
    Connectivity testing (kernel + DPDK)
    Devices allocated appropriately to pod
    Rate-limiting information retained after pod deletion
    
    Closes-Bug: #1915951
    
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: Ie893fee61fa15e28e1994b5e766ed3bca2ff4050

commit 9c62c83536b737e731b140c95be37d74769989ff
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Tue Feb 9 23:08:32 2021 -0500

Ansible update to support remote subcloud restore
    
    In this commit:
      - Common code to validate target is updated to include additional
        host check options (system readiness, load version, bootstrap ip,
        and patches).
      - A new playbook is added to perform host check for various
        use cases (remote install, pre subcloud restore, pre subcloud
        upgrade, etc...).
      - Install playbook is updated to make use of the new playbook.
      - A new role which performs generic user input validation for all
        restore playbooks is added.
      - A new B&R parameter is added to indicate where the backup data
        can be found, on the host itself (on-box) or on another
        machine (off-box).
      - Platform, user images and openstack restore playbooks are
        updated to a) make use of the on_box_data parameter,
        b) use the same target_backup_dir for both local and remote
        playbook execution for consistent behavior.
      - Host override file is extracted only on the target.
      - Patches restore is skipped if requested by the caller. Default
        behavior is to restore patches.
      - Various subtle bugs are fixed.
      - Ansible version is specified in Zuul test requirements.
    
    Tests:
      - Deployment of a Redfish capable subcloud
      - Remote restore a simplex with various options: a) without patches
        b) skip patches restore, c) with patches restore, d) on-box backup
        tarball, e) off-box backup tarball
      - Local restore of a simplex with 2 options: a) without patches and
        b) skip patches restore
      - Simplex fresh install
      - Restore user images
      - Restore OpenStack
      - Error cases
    
    Task: 41725
    Story: 2008573
    Change-Id: Ica2b9010a73854a01216e2e16b581484d182264e
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 64af997f72d2e78615bd38d573aa13d0a59932da
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Thu Feb 4 17:55:24 2021 -0500

Fetch additional image config file for remote play
    
    When invoking the playbook remotely, the additional system images
    config file locates in the remote host may not also exist in the
    control host. This commit fetches the additional image config file to
    the control host to prevent include_vars failure.
    
    Tests:
    1 Remote backup and restore an AIOSX node with the additional system
    images config file from a control host without the config file.
    2 Build an image ISO and installed/ bootstrapped an AIOSX node.
    
    Closes-Bug: 1914611
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
    Change-Id: I79362b00a25ca031a1fbbaa92476955e35477b73

commit 6226fa7a28ee271f92419bbeb62f2ee512e4b192
Author: Mingyuan Qi <mingyuan.qi@intel.com>
Date:   Tue Dec 8 05:52:41 2020 +0000

Add provision edgeworker playbook
    
    This playbook provisions edgeworker nodes to setup OS
    configurations as well as join the node to the Kubernetes
    platform.
    
    This playbook should be triggered manually after an edgeworker node
    added to sysinv. Make sure the edgeworker node has got the correct
    mgmt ip address from controller's dhcp server.
    
    1. Create an inventory file for edgeworker nodes on the active
    controller:
    
    tee ./edgeworker_inventory.yml << EOF
    all:
      hosts:
        localhost:
          ansible_connection: local
      children:
        edgeworker:
          hosts:
            <edgeworker hostname>:
              ansible_ssh_user: <admin username>
              ansible_ssh_pass: <ssh password>
              ansible_become_pass: <admin password>
              ansible_python_interpreter: <python path(e.g.
    /usr/bin/python3)>
            #<more edgeworker nodes>:
      vars:
        ansible_ssh_user: sysadmin
        ansible_ssh_pass: St8rlingX*
        ansible_become_pass: St8rlingX*
    EOF
    
    2. Trigger the playbook provision_edgeworker.yml:
    
    ansible-playbook \
    -i ./edgeworker_inventory.yml \
    /usr/share/ansible/stx-ansible/playbooks/provision_edgeworker.yml
    
    After the provisioning, the edgeworker node will be shown Ready in
    Kubernetes.
    
    Test:
    Edgeworker node OS: Ubuntu 18.04/20.04
    AIO-SX + edgeworker nodes: PASS
    AIO-DX + edgeworker nodes: PASS
    Standard + edgeworker nodes: PASS
    
    Story: 2008129
    Task: 40879
    
    Signed-off-by: Mingyuan Qi <mingyuan.qi@intel.com>
    Change-Id: If189f915461da3ce79dea3a688a3cf5e59d8f12c
    Depends-On: https://review.opendev.org/c/starlingx/config/+/763918

commit 2a5e37cecfde7572331b0c7f13e8ea55400c5260
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Mon Feb 1 09:26:48 2021 -0500

Remove RPF config of tunl0 interfaces from kernel
    
    Calico project removed "strict" mode for RPF from kernel's IPv4
    configuration at version 3.12, moving back to former approach of
    setting iptables rules for security purposes (see
    https://github.com/projectcalico/felix/pull/2189). Thus, forcing
    disable of RPF at tunl0 interface as introduced by the fix of
    bug 1838801 isn't necessary anymore.
    This commit removes that forced writing of kernel configuration,
    so that calico-node pod is healthy after started, and asymmetrical
    routing still works as required.
    
    Implements: removal of postStart hook for Calico pods
    Closes-Bug: 1912807
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
    Change-Id: Iea8f4aa988269f99f5632fec2cf344297709eaef

commit adf542c16b71e681e47599b1e60a17311efeb6a6
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Thu Jan 21 09:05:22 2021 +0200

B&R: Fix restore of ceph using mds
    
    Calling /etc/init.d/ceph without specifying a component applies an
    action to all components.
    After cephfs_platform_integ topic merged, ceph mds component is also
    used.
    Mds requires a monitor to be up before starting.
    Due to the fact that the ceph script was called without specifying the
    components mds the start action was applied to mds also and it was
    applied before starting the monitor, resulting in failure.
    
    The fix is to specify the monitor and osd components.
    Mds will be start when ceph is restarted a few tasks later.
    
    Plus minor ansible changes:
    - Use failed_when instead of ignore_errors
    - Renamed task
    
    Closes-Bug: 1912488
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I5388ddd32923bbd9acb205fa2d2adbc083346631

commit c62c3aa8890ef6c0d80ba8397ef64f5b0fb8c291
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Jan 18 15:49:21 2021 -0500

Bug fix: static image type error
    
    A bug was introduced by commit
    587c1ff2075c518f32bb93f14f66bcd23e542259 that the bootstrap of a
    distributed cloud system controller will fail due to the RVMC image
    is a string rather than a list. This commit converts it to a single
    member list before appending it to static images list.
    
    Tested by bootstraping a distributed cloud system controller.
    
    Change-Id: Ib2e0c6cc1913e0ee44f8b288004459c497ebedb8
    Closes-Bug: 1908100
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit 6cf33738447ba9f65d0f5aec1507991676522a58
Author: Sabeel Ansari <Sabeel.Ansari@windriver.com>
Date:   Tue Dec 22 12:22:04 2020 -0500

Create deployment namespace during bootstrap
    
    A generic namespace is being created during bootstrap
    called 'deployment'. This namespace will be shared by
    various deployed platform resources including platform
    certificates.
    
    Change-Id: I48bca00cc882fae691786fd1e200e28bf126496b
    Story: 2007361
    Task: 41495
    Signed-off-by: Sabeel Ansari <Sabeel.Ansari@windriver.com>

commit 587c1ff2075c518f32bb93f14f66bcd23e542259
Author: Yuxing Jiang <yuxing.jiang@windriver.com>
Date:   Mon Dec 14 11:55:32 2020 -0500

Upgrade: append additional images to the static images list
    
    Images specified in additional_local_registry_images at install time
    will not be upgraded after an upgrade is completed. This commit allows
    common/load-images-information load additional images list from
    /usr/share/additional-system-images.yml by default.
    
    In a distributed cloud system, the Redfish Virtual Media
    Controller(RVMC) image is can support remote install on Redfish
    configured hosts. This commit includes the RMVC image in the static
    images list if the host is a DC controller, enables
    download/push/update this image with other static images.
    
    Tested by installing and upgrading an AIODX central cloud with an
    AIOSX subcloud DC system.
    
    Partial-Bug: 1908100
    Change-Id: I1f927f876f4883a587098c61fbcaf408d65fdde4
    Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>

commit a83a13a0e8c6923ea4e9f74797269aa9ee4bf02d
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Mon Jan 11 00:46:54 2021 -0500

Ensure docker proxies are defined
    
    System triggered playbooks such as upgrade-static-images,
    will fail to pull newer images from AWS ECR if either
    docker http or https proxy is not configured. This commit
    resolves the issue.
    
    Tests:
      - Fresh install with public registries
      - Fresh install with a private registry
      - Fresh install with AWS registry, http proxy configured
      - Fresh install with AWS registry, proxies not configured
          a. Perform controller swact
          b. Perform upgrade
    
    Closes-Bug: 1910951
    Change-Id: I93c2856210dde90a66bae823178e14f140f711f4
    Signed-off by: Tee Ngo <tee.ngo@windriver.com>

commit bbcec5f2d678a0f18c02eb514e274a6187c21e32
Author: Angie Wang <angie.wang@windriver.com>
Date:   Wed Oct 14 15:56:39 2020 -0400

Configure SQL as helm storage backend
    
    Configmap is the default helmv2 storage backend to store
    release information but its 1MB resource limit prevents
    scaling up stx openstack worker nodes, so we want to use
    SQL as helm storage backend.
    
    Update armada overrides to start up tiller with SQL storage
    backend.
    
    Tested:
    - AIO-SX, AIO-DX, STD installation (IPv4 and IPv6)
    - apply stx-openstack
    - host-swact/lock/unlock
    - AIO-SX, AIO-DX upgrade with stx-openstack, stx-monitor
    - backup&restore with stx-openstack
    
    Closes-Bug: 1887677
    Depends-on: https://review.opendev.org/#/c/761647/
    Change-Id: I8ad7194973e8fc60ee2539dea5da67b15be1df4a
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 56e338e5c729242dd14a7ecd4d294d65ca069863
Author: Tee Ngo <tee.ngo@windriver.com>
Date:   Wed Dec 16 23:52:26 2020 -0500

Improve bootstrap config validation
    
    A misconfigured docker_registries when merged with the
    default docker_registries will lead to bootstrap failure
    with a misleading error that is difficult for the user to
    identify the root cause.
    
    Add tasks to ensure user provided registry keys are valid.
    
    Closes-Bug: 1908479
    Change-Id: I829496481708221a772a9242dc660a2462880116
    Signed-off-by: Tee Ngo <tee.ngo@windriver.com>

commit 820f347324d4fb7d17396d5d21f98eb2b674d23a
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Sat Oct 31 00:32:03 2020 +0800

Enable etcd with security setting.
    
    Add etcd client/server certificate generation process in
    ansible playbook and restart etcd after etcd security config
    enabled.
    
    Test status.
    1) Deployment test on both simplex and duplex  - PASS
       check the communication status between apiserver
       and etcd by kubectl command
       check the etcd status and configuration on controllers
    2) Switch active controller - PASS
       After switching, check the communication status between
       apiserver and etcd by kubectl command
       check the etcd status and configuration on controllers
    3) Lock/unlock of a simplex controller - PASS
    4) Backup/Restore test on simplex - PASS
    5) Spontaneous reboot of a simplex controller - PASS
    6) Enable secured etcd on simplex by script after deploy
       unsecured etcd. - PASS
    7) Backup test on Duplex - PASS
    8) Restore test on Duplex - PASS
    9) Re-installing a controller host on a duplex setup and
       then swacting to it - PASS
    
    Partial-Bug: 1894870
    
    Depends-on: https://review.opendev.org/#/c/760508/
    Change-Id: I88691b84c9acc2e27f0b783d7454a873d3490072
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 0df607a49b514f143a9198c6fbba1e9e846e6c4f
Author: Jim Gauld <james.gauld@windriver.com>
Date:   Tue Jul 14 14:21:14 2020 -0400

Support upgrade to Helm v3 with containerized armada
    
    This provides an upgrades playbook for containerized armada
    to launch armada using Helm v3.
    
    This refactors common code from bringup_helm.yml so that it may be
    called from either bootstrap or upgrade.
    
    Story: 2007927
    Task: 40354
    Closes-Bug: 1906554
    
    Change-Id: I061006683252a8592f07c90d6d82bdc109418451
    Signed-off-by: Jim Gauld <james.gauld@windriver.com>
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 936881f954f8acb2ed0b69de47f576d49ed82d2d
Author: Cole Walker <cole.walker@windriver.com>
Date:   Mon Nov 30 17:27:56 2020 -0500

Check local pods only to prevent B&R timeout
    
    This fix reworks a kubectl wait command to only wait for ready pods on
    the local controller. This fixes an issue where restoring a cluster with
    many nodes can timeout during "Start wait for armada,
    calico-kube-controllers & coredns deployments to reach Available state".
    This timeout was happening because the kubectl command was waiting for
    pods on downed nodes to become ready for 30 seconds per pod. In cases
    where there were 5 or more unreachable pods, the async timeout value of
    120 seconds would be reached and the kubectl wait commands would be
    terminated before completion. This prevented subsequent Ansible tasks
    from completing and resulted in a parsing error during "Fail if any of
    the Kubernetes component, Networking or Armada pods are not ready by
    this time"
    
    The fix here adds --field-selector spec.nodeName=$(hostname) to the
    kubectl wait command and causes only pods on the running
    controller to be checked. This also ensures that the asyncronous tasks
    will always only take 30 seconds, regardless of the number of nodes in
    the cluser.
    
    Removed the behaviour where the wait time would scale up based on the
    number of nodes, as the return time of the async tasks is now always a
    fixed 30 seconds.
    
    Closes-Bug: 1905788
    
    Change-Id: Idb7b30891d4fd00901aaa69412ef4c59913e21f3
    Signed-off-by: Cole Walker <cole.walker@windriver.com>

commit 843e77819228d1298db192d68746f2591ac3e078
Author: Andy Ning <andy.ning@windriver.com>
Date:   Fri Dec 4 10:14:25 2020 -0500

Restrict permissions on docker registry certificate file
    
    It is noticed that docker registry certificate file
    (/etc/docker/certs.d/registry.local:9001/registry-cert.crt) has
    permission set to 644. This update changes its permissions to
    400 as required, by preserving the original permissions when it
    is copied over in ansible bootstrap.
    
    Change-Id: Ic85fa5fa2595f81d5cde6b0294eb7fbbd9c7dc63
    Closes-Bug: 1906844
    Signed-off-by: Andy Ning <andy.ning@windriver.com>

commit d3e48692a5b28a2466ea9dbf8d0737bebd9df68e
Author: Cole Walker <cole.walker@windriver.com>
Date:   Tue Dec 1 16:50:41 2020 -0500

Lower negative cache TTL for coredns
    
    This change updates the coredns config map to lower the TTL for caching
    negative responses from 30 seconds down to 5 seconds. This will improve
    the response time for cases where a hostname lookup might occur before a
    given pod is created, resulting in a negative entry being cached and
    preventing pods from resolving the name for 30 seconds afterwards.
    
    The default cache size of 9984 items is unchanged, but must be
    explicitly defined in this configuration.
    
    The change also adds coredns to the upgrade-k8s-networking playbook to
    ensure that changes to coredns can be automatically deployed by
    sysinv-conductor.
    
    Closes-Bug: 1906870
    
    Change-Id: I7b44358508c32c8ca4e58b1e69d6232f1a61bfcf
    Signed-off-by: Cole Walker <cole.walker@windriver.com>

commit 1d7305eaf980ea258bcf9e2bb31406bf96ceb766
Author: Jessica Castelino <jessica.castelino@windriver.com>
Date:   Wed Nov 18 01:11:33 2020 -0500

Remove the addition of identity to shared services
    
    Identity services are no longer a shared service in DC.
    This commit removes the addition of identity to the shared service
    list for subclouds.
    
    Change-Id: I3d9d0e4df1a41142cce1ce13d4bbf7d43a626909
    Partial-Bug: 1904675
    Signed-off-by: Jessica Castelino <jessica.castelino@windriver.com>

tags:

added: in-f-centos8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-07: Fix merged to integ (f/centos8)

#15

Download full text (37.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch: f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 18:19:54 2021 +0300

Fix resize of filesystems in puppet logical_volume

    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.

    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146

    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.

Tested the following scenarios:

B&R on SX with default sizes of filesystems and cgts-vg.

B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
cgts-vg with additional physical volumes:

    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

B&R on DX system with backup of size 70G and cgts-vg
with additional physical volumes:

    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk

    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <email address hidden>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <email address hidden>
Date: Thu May 20 14:33:58 2021 +0300

Execute once the ceph services script on AIO

    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh

    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.

We fix this by exiting the ceph script if it is the one from
/etc/services.d/worker on AIO systems.

Closes-Bug: 1928934
Change-Id: I3e4dc313cc3764f870b8f6c640a60338...

Reviewed:  https://review.opendev.org/c/starlingx/integ/+/793754
Committed: https://opendev.org/starlingx/integ/commit/a13966754d4e19423874ca31bf1533f057380c52
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit b310077093fd567944c6a46b7d0adcabe1f2b4b9
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Sat May 22 18:19:54 2021 +0300

Fix resize of filesystems in puppet logical_volume
    
    After system reinstalls there is stale data on the disk
    and puppet fails when resizing, reporting some wrong filesystem
    types. In our case docker-lv was reported as drbd when
    it should have been xfs.
    
    This problem was solved in some cases e.g:
    when doing a live fs resize we wipe the last 10MB
    at the end of partition:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L146
    
    Our issue happened here:
    https://opendev.org/starlingx/stx-puppet/src/branch/master/puppet-manifests/src/modules/platform/manifests/filesystem.pp#L65
    Resize can happen at unlock when a bigger size is detected for the
    filesystem and the 'logical_volume' will resize it.
    To fix this we have to wipe the last 10MB of the partition after the
    'lvextend' cmd in the 'logical_volume' module.
    
    Tested the following scenarios:
    
    B&R on SX with default sizes of filesystems and cgts-vg.
    
    B&R on SX with with docker-lv of size 50G, backup-lv also 50G and
    cgts-vg with additional physical volumes:
    
    - name: cgts-vg
        physicalVolumes:
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
        - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk
    
    B&R on DX system with backup of size 70G and cgts-vg
    with additional physical volumes:
    
    physicalVolumes:
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 50
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-1.0
        size: 30
        type: partition
    - path: /dev/disk/by-path/pci-0000:00:0d.0-ata-3.0
        type: disk
    
    Closes-Bug: 1926591
    Change-Id: I55ae6954d24ba32e40c2e5e276ec17015d9bba44
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit 3225570530458956fd642fa06b83360a7e4e2e61
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Thu May 20 14:33:58 2021 +0300

Execute once the ceph services script on AIO
    
    The MTC client manages ceph services via ceph.sh which
    is installed on all node types in
    /etc/service.d/{controller,worker,storage}/ceph.sh
    
    Since the AIO controllers have both controller and worker
    personalities, the MTC client will execute the ceph script
    twice (/etc/service.d/worker/ceph.sh,
    /etc/service.d/controller/ceph.sh).
    This behavior will generate some issues.
    
    We fix this by exiting the ceph script if it is the one from
    /etc/services.d/worker on AIO systems.
    
    Closes-Bug: 1928934
    Change-Id: I3e4dc313cc3764f870b8f6c640a6033822639926
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

commit b428a5de0070c6df82536b8b5b782810ebd9efda
Author: Cole Walker <cole.walker@windriver.com>
Date:   Wed May 19 16:57:54 2021 +0000

Revert "Remove recover operations to "restart-on-reboot" pods"
    
    This reverts commit 8abcbf6fb1951b25e9964933558b75b9aff88135.
    
    Reason for revert:
    
    After performing a backup and restore on an AIO-SX system, SRIOV pods do
    not return to a running state and are instead stuck in "container
    creating". The workaround for this is to restart SRIOV pods when the
    system unlocks.
    
    Reverting this commit to allow users to label SRIOV pods and have them
    restarted by k8s-pod-recovery. Labelled pods will be restarted by
    k8s-pod-recovery and will be running after backup and restore is
    completed.
    
    This change has been tested by performing backup and restore on an
    AIO-SX system. SRIOV pods now come up correctly when labelled with
    restart-on-reboot=true
    
    Closes-Bug: 1928965
    
    Signed-off-by: Cole Walker <cole.walker@windriver.com>
    Change-Id: I9c520c0a47aabca7b96e50adf0f71742f4199c2f

commit 4e1aa82e96d9b4caeff7e7b31632733c395c6ad0
Author: Robert Church <robert.church@windriver.com>
Date:   Sat May 15 16:24:29 2021 -0400

Update postgres liveness check to support IPv6 addresses
    
    Templating will add square brackets for IPv6 addresses which are
    interpreted as an array vs. a string. Quote this so that it interpreted
    correctly.
    
    Change-Id: I2b705015a74ea2e4e914b7a83cdceed37d49b766
    Related-Bug: #1917308
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit b3540ccfdfa6956fb20c62e5e5bb76af56d2ab63
Author: Robert Church <robert.church@windriver.com>
Date:   Wed May 12 22:36:23 2021 -0400

Update the liveness probe to verify postgres connectivity
    
    Change the tillerLivenessProbeTemplate to test the connectivity to the
    postgres backend. We will override the periodSeconds and
    failureThreshold when installing the helm chart to trigger a restart of
    the tiller pod over a swact when the postgres DB/server moves from one
    controller to the other.
    
    This will help guarantee that the tiller connection is always
    re-established if the connectivity to the postgres backend fails.
    
    Change-Id: I7fbed33a8c821f6c9254f58d5953e2115cf4141a
    Related-Bug: #1917308
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit 03665ae745babb4524e2b9b9cc0f768eaf1e8781
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon May 10 18:54:07 2021 -0400

Add armada namespace in k8s pod recovery
    
    Update k8s pod recovery service to include armada namespace
    so armada pod that stuck in an unknown state after host
    lock/unlock or reboot could be recovered by the service.
    
    Change-Id: Iacd92637a9b4fcaf4c0076e922e1bd739f69a584
    Closes-Bug: 1928018
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

commit 764cac1642a8820d169576da3d8d886449d3cf73
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Tue May 11 17:04:01 2021 +0000

Armada: Fix tiller stuck connecting to postgres database
    
    Tiller may start executing before IPv6 network is fully initialized.
    This will result in tiller not being fully functional.
    The liveness probe will detect that tiller didn't start properly and
    restart it. But this might happen an unlimited number of times in a row.
    
    Wait until ping is succesful to the ip of the postgres database.
    This ensures that networking finished setting up.
    Credits to Cole Walker <cole.walker@windriver.com> for proposing the
    idea.
    
    Depends-On: I177bb628497611eb64472291a04d635856c26590
    Closes-Bug: 1928141
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: I9c5be3f30fad2650e6aa53fb80ef44f7798813ed

commit 1974b3f570c0a21ec5e4cfe7d806c58a01a7dd0c
Author: Don Penney <don.penney@windriver.com>
Date:   Fri May 7 09:01:47 2021 -0400

Copy shim.efi to /pxeboot for UEFI pxeboot support
    
    Package a copy of the shim.efi file to /pxeboot to support UEFI secure
    boot. The recent grub2 update for CVE-2020-15705 requires the use of
    shim.efi in order to support kernel signature validation.
    
    Change-Id: If87925e1697b34d7ff1a7a770d9f13619dd9dd52
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit b1ac60470315153dc9bc03f7f0bb1bfb221f6c5d
Author: Davlet Panech <davlet.panech@windriver.com>
Date:   Wed May 5 10:42:56 2021 -0400

Pin clearlinux/golang to v1.15.10 in Dockerfiles
    
    Upstream Dockerfiles use clearlinux/golang:latest as the base, which is
    broken as of now. Solution: change it to last known working tag before
    building.
    
    Closes-Bug: 1927153
    Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
    Change-Id: Ic13973c0518eeab74ec86884036d08c2b8a4961f

commit 4850ab86da1cecca239d2ffa6dded4c0946e8a43
Author: Li Zhou <li.zhou@windriver.com>
Date:   Tue Apr 13 04:34:32 2021 -0400

systemd: Upgrade to version 219-78.el7_9.3
    
    This fixes the issue of systemd sending tons of useless
    PropertiesChanged messages when a mount happens as described in:
    https://bugzilla.redhat.com/show_bug.cgi?id=1793527
    
    Depends-On: https://review.opendev.org/c/starlingx/tools/+/786601
    Partial-Bug: #1924691
    Signed-off-by: Li Zhou <li.zhou@windriver.com>
    Change-Id: I3596303d77211a135e8559a05806395328725cde

commit 18010eb1d637c8ea3c9fdd9be5684f1b5ee8b23c
Author: Thiago Brito <thiago.brito@windriver.com>
Date:   Thu Mar 25 14:55:11 2021 -0400

Upversioning armada tarball to 7ef4b86
    
    A fix landed upstream to deal with armada waiting indefinitely for
    evicted pods, which intermittently fails stx-openstack
    application. This commit upversions the tarball version to the
    one containing that change.
    
    Removing patches 0002 and 0003 since the commits are already on
    the armada code at this version.
    
    Story: 2008645
    Task: 41906
    
    Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
    Change-Id: I62caf0a403a054c30b5bbfc1a3c5bc4cf73b60a6

commit ccfeeef59d39e42b2775bb5a216732c4999f6e42
Author: Li Zhou <li.zhou@windriver.com>
Date:   Mon Apr 12 02:15:25 2021 -0400

systemd: Prevent excessive /proc/1/mountinfo reparsing
    
    Backport the patches for this issue:
    https://bugzilla.redhat.com/show_bug.cgi?id=1819868
    
    We met such an issue:
    When testing a large number of pods (> 230), occasionally observed a
    number of issues related to systemd process:
        systemd ran continually 90-100% cpu usage
        systemd memory usage started increasing rapidly (20GB/hour)
        systemctl commands would always timeout (Failed to get properties:
            Connection timed out)
        sm services failed and can't recover: open-ldap,
            registry-token-server, docker-distribution, etcd
        new pods can't start, and got stuck in state ContainerCreating
    
    Those patches work to prevent excessive /proc/1/mountinfo reparsing.
    It has been verified that those patches can improve this performance
    greatly.
    
    16 commits are listed in sequence (from [1] to [16]) at below link
    for the issue:
    https://github.com/systemd-rhel/rhel-8/pull/154/commits
    
    [16](10)core: prevent excessive /proc/self/mountinfo parsing
    [15][Dropped-6]test: add ratelimiting test
    [14](9)sd-event: add ability to ratelimit event sources
    [13](8)sd-event: increase n_enabled_child_sources just once
    [12](7)sd-event: update state at the end in event_source_enable
    [11](6)sd-event: remove earliest_index/latest_index into common part of
    event source objects
    [10][Dropped-5]sd-event: follow coding style with naming return
    parameter
    [9] [Dropped-4]sd-event: ref event loop while in sd_event_prepare() ot
    sd_event_run()
    [8] (5)sd-event: refuse running default event loops in any other thread
    than the one they are default for
    [7] [Dropped-3]sd-event: let's suffix last_run/last_log with "_usec"
    [6] [Dropped-2]sd-event: fix delays assert brain-o (#17790)
    [5] (4)sd-event: split out code to add/remove timer event sources to
    earliest/latest prioq
    [4] (3)sd-event: split clock data allocation out of sd_event_add_time()
    [3] [Dropped-1]sd-event: mention that two debug logged events are
    ignored
    [2] (2)sd-event: split out enable and disable codepaths from
    sd_event_source_set_enabled()
    [1] (1)sd-event: split out helper functions for reshuffling prioqs
    
    I ported 10 of them back (from (1) to (10)) to fix this issue
    and dropped the other 6 (from [Dropped-1] to [Dropped-6]) for those
    reasons:
    [Dropped-1]Only changes error log.
    [Dropped-2]Fixes a bug introduced in a commit which doesn't exist in
    this version.
    [Dropped-3]Only changes vars' names and there is no functional change.
    [Dropped-4]More commits are needed for merging it, while I don't see
    any help on adding the rate-limiting ability.
    [Dropped-5]Change coding style for a function which isn't really used
    by anyone.
    [Dropped-6]Add test cases.
    
    Closes-Bug: #1924686
    Signed-off-by: Li Zhou <li.zhou@windriver.com>
    Change-Id: Ia4c8f162cb1a47b40d1b26cf4d604976b97e92d6

commit e62b1a53b9148738a7c36355b19607d6e6f3d0d7
Author: David Sullivan <david.sullivan@windriver.com>
Date:   Tue Apr 20 17:32:45 2021 -0500

Unmount all targets during drbd stop
    
    When stopping drbd, we need to unmount targets from each device.
    Devices with multiple mountpoints can fail to unmount, leading to
    metadata corruption. Add --all-targets to the umount command.
    
    Closes-Bug: 1920245
    Signed-off-by: David Sullivan <david.sullivan@windriver.com>
    Change-Id: Ic1b4583c72a0dd256724b8672dbb59126273330b

commit de263f633ea359507357d3d4c53e98a71bff5afc
Author: Cole Walker <cole.walker@windriver.com>
Date:   Tue Apr 13 16:47:24 2021 -0400

Add alternative command to disable lldp agent for i40e devices
    
    LLDP information is not available for certain i40e network devices when
    running system host-lldp-neighbor-show.
    
    This is caused by the firmware lldp agent on the devices not getting
    disabled by the i40e-lldp-configure.sh script which is invoked by lldpd.
    
    The command used to disable the firmware lldp agent in the script works
    for some firmware versions found on devices, but not others. This change
    adds an ethtool command to disable the lldp agent which works for these
    other firmware versions.
    
    From testing, the ethtool method is used for firmware versions 5.05 and
    8.10. The sysfs method is used for firmware version 7.10. In all cases,
    the driver version is 2.14.13
    
    Closes-Bug: 1923665
    
    Signed-off-by: Cole Walker <cole.walker@windriver.com>
    Change-Id: Ifac34091599bd4020bf55cc1b8ba3119edccb297

commit 3924cfe7ae390678ae4df9b544acf8b373440183
Author: Marcus Secato <marcus.viniciuscarvalhosecato@windriver.com>
Date:   Thu Apr 15 17:52:58 2021 -0400

Set proper user ID for armada-api container
    
    Since armada application moved to Kubernetes cluster, processes and
    commands are not executed with the 'armada' user in armada-api
    container. Previously when armada was a separated container user was
    enforced through 'docker exec'.
    
    Closes-Bug: 1924579
    
    Signed-off-by: Marcus Secato <marcus.viniciuscarvalhosecato@windriver.com>
    Change-Id: I5600974c0b9c3ade73a58dae300e8f3b18c6aefd

commit 8abcbf6fb1951b25e9964933558b75b9aff88135
Author: Bin Qian <bin.qian@windriver.com>
Date:   Thu Apr 8 12:58:44 2021 -0400

Remove recover operations to "restart-on-reboot" pods
    
    The pods being labeled as "restart-on-reboot" is to workaround
    kubernetes restart on worker manifest. As the AIO running a
    single manifest to start kubernetes only once, the operation
    is no longer needed.
    
    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/785736
    Change-Id: I0d6c549199559b2bc19d8edff52f64ea0b08b50d
    Closes-Bug: 1918139
    Signed-off-by: Bin Qian <bin.qian@windriver.com>

commit 859e8eb7309f6c26f1ebbc0898e87e82d56af97b
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Tue Apr 14 20:16:48 2020 -0400

add isolcpus device plugin for kubernetes
    
    In order to minimize latency as much as possible, we want to allow
    kubernetes containers to make use of CPUs which have been specified
    as "isolated" via the kernel boot args.
    
    This commit creates an isolcpus device plugin, which detects the isolated
    CPUs and exports them to kubelet via the device plugin API.
    
    See kubernetes/plugins/isolcpus-device-plugin/files/README.md for
    more information on the behaviour and design choices for this commit.
    
    When we move to a newer version of the Intel device plugin manager we
    may be able to simplify some of this.  See the above README.md file
    for details.
    
    Change-Id: I3bfe04ab6e7fbafefa63f6dc43cb2ed79a52579f
    Story: 2008760
    Task: 42165
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 777b7d88630bae55bf130e240212a2abf288bbd3
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Mon Oct 26 17:30:00 2020 -0400

enable support for kubernetes to ignore isolcpus
    
    The normal mechanisms for allocating isolated CPUs do not allow
    a mix of isolated and exclusive CPUs in the same container.  In
    order to allow this in *very* limited cases where the pod spec
    is known in advance we will add the ability to disable the normal
    isolcpus behaviour.
    
    If the file "/etc/kubernetes/ignore_isolcpus" exists, then kubelet
    will basically forget everything it knows about isolcpus and just
    treat them like regular CPUs.
    
    The admin user can then rely on the fact that CPU allocation is
    deterministic to ensure that the isolcpus they configure end up being
    allocated to the correct pods.
    
    Story: 2008760
    Task: 42164
    Change-Id: Ie38c81209ee407ac98b4882f2581fc14622b3af1
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 4150b7a6b61365525c6201ad04eb678c96a578d5
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Mon Aug 31 11:06:59 2020 -0400

kubeadm: create platform pods with zero CPU resources
    
    We want to specify zero CPU resources when creating the manifests
    for the static platform pods, as a workaround for the lack of
    separate resource tracking for platform resources.
    
    We also specify zero CPU resources for the coredns deployment.
    manifests.go appears to be the main file for this, not sure if the
    others are used by I changed them just in case.
    
    Story: 2008760
    Task: 42163
    Change-Id: I6410b8af556d5167d1813e7545fad8baa27b1100
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 71876817aa2a0a0109d7dfe6bf34c1344b3d5f06
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Apr 24 02:25:22 2020 -0400

fix exclusive CPU alloc being deleted at container restart
    
    The expectation is that exclusive CPU allocations happen at pod
    creation time. When a container restarts, it should not have its
    exclusive CPU allocations removed, and it should not need to
    re-allocate CPUs.
    
    There are a few places in the current code that look for containers
    that have exited and call CpuManager.RemoveContainer() to clean up
    the container.  This will end up deleting any exclusive CPU
    allocations for that container, and if the container restarts within
    the same pod it will end up using the default cpuset rather than
    what should be exclusive CPUs.
    
    Removing those calls and adding resource cleanup at allocation
    time should get rid of the problem.
    
    This should eventually go into upstream 1.18.1, at which point
    we can just revert this commit.
    
    Story: 2008760
    Task: 42160
    Change-Id: I61d3670805ef805e21b9c54daf0677d4c7e1bc74
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit b88df951face924d0c29fa76ee03424c4546afd2
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Apr 24 02:18:00 2020 -0400

Add kubernetes support for isolated cpus
    
    This introduces the concept of "isolated CPUs", which are CPUs that
    have been isolated at the kernel level via the "isolcpus" kernel boot
    parameter.
    
    When starting the kubelet process, the set of reserved CPUs (including
    both platform and isolated CPUs) will be specified via
    '--reserved-cpus'.  The isolated CPUs will be identified by looking at
    "/sys/devices/system/cpu/isolated" and treated separately from the
    platform CPUs (which are used to run infrastructure pods).
    
    A plugin (outside the scope of this commit) exposes the isolated
    CPUs to kubelet via the device plugin API.
    
    If a pod specifies some number of "isolcpus" resources, the device manager
    will allocate them.  In this code we check whether such resources have
    been allocated, and if so we set the container cpuset to the isolated
    CPUs.  This does mean that it really only makes sense to specify "isolcpus"
    resources for best-effort or burstable pods, not for guaranteed ones since
    that would throw off the accounting code.  In order to ensure the accounting
    still works as designed, if "isolcpus" are specified for guaranteed pods,
    the affinity will be set to the non-isolated CPUs.
    
    Story: 2008760
    Task: 42161
    Change-Id: I7bd2eabb4c82faea63e3ad129ef735b9d1223e11
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit a436a17d711db209c2bc8802360b9f31df16c237
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Fri Apr 24 02:12:35 2020 -0400

kubelet cpumanager patches for low-latency
    
    In order to minimize latency as much as possible, kubernetes containers
    require isolation of platform CPUs, isolcpus CPUs, and shared CPUs. For
    Guaranteed pods, we also need to disable CFS quota throttling.
    
    Infrastructure pods are allowed to run on platform CPUs since they're
    basically doing platform work.  This frees up some resources on
    application CPUs for "normal" kubernetes containers.
    
    Story: 2008760
    Task: 42159
    Change-Id: I28c99565ed8081496f6d8be4aa68144a1d3578ed
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>

commit 9d60767e32ab937be2e7d1beaf86c8651dd2ac5a
Author: Li Zhou <li.zhou@windriver.com>
Date:   Wed Mar 31 23:32:19 2021 -0400

ntp: fix CVE-2020-13817
    
    Update ntp source package to:
    ntp-4.2.6p5-29.el7.centos.2.src.rpm
    In fact it is version ntp-4.2.6p5-29.el7_8.2.
    (Refer to https://git.centos.org/rpms/ntp/c/
    e9ba41e9edf8efad8f090aad24845b8f4db0668d?branch=c7)
    
    Story: 2008532
    Task: 41691
    Signed-off-by: Li Zhou <li.zhou@windriver.com>
    Change-Id: If5db6b15b9c01a20a614bb160bba575c6b578d3e

commit 7badc1dad154bd28a8d299d748854dad53606c82
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Mar 3 12:15:52 2021 +0000

integ: add nvidia gpu-operator helm charts
    
    This commit adds nvidia gpu-operator helm charts use case for
    custom container runtime feature. To load nvidia-gpu-operator
    on starlingx:
    
    system service-parameter-add platform container_runtime \
    custom_container_runtime=\
    nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
    
    And define  runtimeClass for nvidia gpu  pods:
    
    kind: RuntimeClass
    apiVersion: node.k8s.io/v1beta1
    metadata:
      name: nvidia
    handler: nvidia
    
    The above will direct all containerd creations of pods with nvidia
    runtimeClass to nvidia-container-runtime -- where the nvidia-conta
    iner-runtime is installed by the operator onto a hostMount.
    
    Story: 2008434
    Task: 41978
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: Ifea8cdf6eb89a159f446c53566279e72fcf0e45e

commit 9c8d4bbcfb0d2b84bfc17276f4afa906b8e97686
Author: Chris Friesen <chris.friesen@windriver.com>
Date:   Wed Jul 15 19:45:24 2020 -0400

fix net/http caching of broken persistent connections
    
    The net/http transport code is currently broken, it keeps broken
    persistent connections in the cache if a write error happens during
    h2 handshake.
    
    This is documented in the upstream bug at:
    https://github.com/golang/go/issues/40213
    
    The problem occurs because in the "go" compiler the http2 code is
    imported into http as a bundle, with an additional "http2" prefix
    applied.  This messes up the erringRoundTripper handling because
    the name doesn't match.
    
    The solution is to have the "go" compiler look for an interface
    instead, so we add a new dummy function that doesn't actually do
    anything and then the "go" compiler can check whether the specified
    RoundTripper implements the dummy function.
    
    Specifically for Kubernetes we need to update the http2 code in the
    "vendor" subdirectory.  A separate change is being made in the "go"
    compiler.
    
    Partial-Bug: 1887438
    Depends-On: https://review.opendev.org/c/starlingx/compile/+/780669
    Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
    Change-Id: I95dcbda879973524cd23b2a374537a675ce9435f

commit f161f7f18e6a12592f8e807b15576d6609d5946e
Author: Jim Gauld <James.Gauld@windriver.com>
Date:   Mon Mar 29 12:31:25 2021 +0000

Revert "integ: gpu-operator helm charts"
    
    This reverts commit 41bdf53f65684b54abaa3098a5fe3acf568cdf2a.
    
    Reason for revert: gpu operator patch is breaking stx-master build.
    
    e.g.,
    08:06:44 Failed to build packages:  gpu-operator-1.6.0-0.tis.1.src.rpm; problem with:
    Patch #2 (enablement-support-on-starlingx-cloud-platform.patch):
    . .
    Skipping patch.
    1 out of 1 hunk ignored -- saving rejects to file deployments/gpu-operator/templates/operator.yaml.rej
    patching file deployments/gpu-operator/values.yaml
    error: Bad exit status from /var/tmp/rpm-tmp.VQuqLh (%prep)
    
    Change-Id: Id7a05987586582c940d605874d1e0f813333f2c3

commit 41bdf53f65684b54abaa3098a5fe3acf568cdf2a
Author: Babak Sarashki <babak.sarashki@windriver.com>
Date:   Wed Mar 3 12:15:52 2021 +0000

integ: gpu-operator helm charts
    
    This commit adds nvidia gpu-operator helm charts use case for
    custom container runtime feature. To load nvidia-gpu-operator
    on starlingx:
    
    system service-parameter-add platform container_runtime \
    custom_container_runtime=\
    nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
    
    And define  runtimeClass for nvidia gpu  pods:
    
    kind: RuntimeClass
    apiVersion: node.k8s.io/v1beta1
    metadata:
      name: nvidia
    handler: nvidia
    
    The above will direct all containerd creations of pods with nvidia
    runtimeClass to nvidia-container-runtime -- where the nvidia-conta
    iner-runtime is installed by the operator onto a hostMount.
    
    Story: 2008434
    Task: 41978
    
    Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
    Change-Id: I999804d4697349bc0966d0a6e653d7bce15e18fc

commit 3832fabeff1493d424593ec502d261508d9e6e75
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Mon Mar 22 15:06:41 2021 -0400

Upgrade pf-bb-config to version 21.3
    
    Upgrade of pf-bb-config package to v21.3 is required by Intel in order
    to have better support to ACC100 (Mount Bryce) device.
    
    Story: 2008440
    Task: 42090
    Change-Id: I2af1ca9fc43ae78f41f30f4bde255afaacb56c46
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>

commit 852ec5ed538f5091ee7e6aa604be68295c09d21b
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Thu Mar 4 17:36:54 2021 +0200

Add custom apps in the k8s-pod-recovery service
    
    At startup, there might be pods that are left in unknown states.
    The k8s-pod-recovery service takes care of
    recovering these unknown pods in specific namespaces.
    To fix this for custom apps that are not part of starlingx,
    we modify the service to look into the /etc/k8s-post-recovery.d
    directory for conf files. Any app that needs to be recovered by this
    service will have to create a conf file e.g the app-1 will create
    /etc/k8s-post-recovery.d/APP_1.conf which will contain the following:
    namespace=app-1-namespace
    
    Closes-Bug: 1917781
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
    Change-Id: I8febdb685d506cff3c34946163612cafdab3e3a8

commit 6169cc5d81f809a1237ba341f7cb87d09fdd811e
Author: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
Date:   Thu Mar 11 09:12:49 2021 -0500

Handle labeled pods after stabilized
    
    Pods that are in a k8s deployment, daemonset, etc can be labeled as
    restart-on-reboot="true", which will automatically cause them to be
    restarted after the worker manifest has completed in an AIO system.
    It may happen, however, that k8s-pod-recovery service is started
    before the pods are scheduled and created at the node the script is
    running on, causing them to be not restarted. The proposed solution is
    to wait for stabilization of labeled pods before restarting them.
    
    Closes-Bug: 1900920
    Signed-off-by: Douglas Henrique Koerich <douglashenrique.koerich@windriver.com>
    Change-Id: I5c73bd838ab2be070bd40bea9e315dcf3852e47f

commit cb85cff32ba0afc48fbe16ab94dd36edc979fbb4
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Wed Jan 20 21:41:20 2021 -0500

dhcp: fix CVE-2019-6470
    
    Upgrade dhcp pkg to dhcp-4.2.5-82.el7.centos.src.rpm
    
    Adjust the context of the patch to match to apply the new version.
    At the same time as the new version depends on the bind-export
    pacakges, so we also add the dependence package in tools repo.
     bind-export-libs-9.11.4-26.P2.el7.x86_64.rpm
     bind-export-devel-9.11.4-26.P2.el7.x86_64.rpm
    
    In addition, since the patch dhcp-dhclient_ipv6_prefix.patch set the
    default prefixlen to 128, which is usually the specifications call
    for host address and it doesn't include any on-link information.
    By contrast, 64 indicates that's subnet area, and this vaule is used
    frequently as usual. So we still use the previous value 64.
    As a result we don't need to modify the relevant place where every
    application code needed for the compatibility any more.
    
    Depends-On: https://review.opendev.org/c/starlingx/tools/+/772241
    
    Story: 2008532
    Task: 41638
    Change-Id: I0305711790d8e3fb1adfa69e1077468456b65d84
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>

commit 7d7fe3dc61864e38c6f183aa6ae7583844b44183
Author: Joe Slater <joe.slater@windriver.com>
Date:   Wed Feb 24 17:39:05 2021 -0500

sudo: fix CVE-2021-3156
    
    Advance to sudo-1.8.23-10.el7_9.1.src.rpm.
    
    Closes-Bug: 1916946
    Change-Id: Ibb90439c77d6f5b1badcadb37080ff9e330787d5
    Signed-off-by: Joe Slater <joe.slater@windriver.com>

commit eccff3b0e661592084d9114a9a41816761e1f9b5
Author: Steven Webster <steven.webster@windriver.com>
Date:   Wed Feb 17 12:39:52 2021 -0500

Uprev SR-IOV CNI image
    
    This commit uprevs the SR-IOV CNI image to pick up a few bug
    fixes.  Specifically, this commit will allow rate-limiting
    configuration on a VF to be retained after the VF has been
    used by a pod (and pod subsequently deleted).
    
    Testing:
    
    NICs:
    Ethernet Controller X710 for 10GbE SFP+
    Mellanox MT27700 Family [ConnectX-4]
    
    Functional:
    Connectivity testing (kernel + DPDK)
    Devices allocated appropriately to pod
    Rate-limiting information retained after pod deletion
    
    Partial-Bug: #1915951
    
    Signed-off-by: Steven Webster <steven.webster@windriver.com>
    Change-Id: I32395c4805164401519cde8bc503f040c4187250

commit 8a33372bee65b517850245b55f771e3cd6bba0ff
Author: Babak Sarashki <zbsarashki@gmail.com>
Date:   Thu Feb 4 23:21:48 2021 +0000

Add: PF Baseband Device config application for ACC100
    
    This introduces PF BBDEV (baseband device) Configuration Application
    "pf_bb_config" and inih. PF BBDEV program accesses the configuration
    space and sets the various parameters through memory-mapped IO
    read/writes. This is needed for Intel ACC100 (Mt Bryce) configuration
    and QMGR related settings.
    
    PF BBDEV requires inih for parsing .INI configuration file. This
    commit adds the inih for static linkage with PF BBDEV.
    
    Story: 2008440
    Task: 41472
    Signed-off-by: Babak Sarashki <zbsarashki@gmail.com>
    Change-Id: Idaebcac5d0021d5c11c7ab27e13176139ba66c3b

commit 7b5b3aeabfdb47b51fce5f1591d82fd3ca5d9672
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Wed Feb 10 21:00:04 2021 -0500

Revert "dhcp: fix CVE-2019-6470"
    
    This reverts commit 613fbf258f72042f912a1fde5608168b1068db36.
    
    Since this upversioned package updates the prefixlen to 128, and it
    will occur all hosts offline after booting off the controller-0.
    At the same time this issue will block the use of recent loads for
    both development and test activities. So we revert the patch firstly,
    and investigate deeply then send the new review and request of the
    upgraded patch with the appropriate offline fix.
    
    Closes-Bug: #1915050
    
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>
    Change-Id: I02ecaa1bda463efb38d9c32a47f2221d0de7f99d

commit 29dd2fd42acf3e37e53715ca4755b8b089c743cb
Author: Li Zhou <li.zhou@windriver.com>
Date:   Tue Jan 26 07:50:09 2021 +0000

openssh: fix CVE-2018-15473 from source build
    
    Upgrade to openssh-7.4p1-21 for fixing CVE.
    
    Story: 2008532
    Task: 41668
    Signed-off-by: Li Zhou <li.zhou@windriver.com>
    Change-Id: Ic3e10b3455587bba16585fe8e235c4c0655f1e3e

commit d053c675546e944d0a08fb6f8d2b831647f70663
Author: Li Zhou <li.zhou@windriver.com>
Date:   Tue Jan 26 07:21:41 2021 +0000

sudo: fix CVE-2019-18634
    
    Upgrade to sudo-1.8.23-10 for fixing CVE.
    
    Story: 2008532
    Task: 41689
    Signed-off-by: Li Zhou <li.zhou@windriver.com>
    Change-Id: I863e66ee887de40d75db7951f4ba408ad022c131

commit a0b2acecaac080345c1cd42c3ad7fc05d75ac96a
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Mon Jan 25 03:49:38 2021 -0500

grub2: fix CVE-2020-15707
    
    Avoid to the heap-based buffer overflow.
    
    Upgrade to the below package to fix the CVE issue:
     grub2-2.02-0.86.el7.centos.src.rpm
    
    At the same time adjust the context and drop
    0004-grub2-remove-32b-requirements.patch since it already had been
    included in the new version.
    
    Story: 2008532
    Task: 41664
    Change-Id: I7943127323ee28457ffe0a4ece54764633f86d9f
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>

commit 613fbf258f72042f912a1fde5608168b1068db36
Author: Zhixiong Chi <zhixiong.chi@windriver.com>
Date:   Wed Jan 20 21:41:20 2021 -0500

dhcp: fix CVE-2019-6470
    
    Upgrade dhcp pkg to dhcp-4.2.5-82.el7.centos.src.rpm
    
    At the same time since the new version depends on the bind-export
    pacakge, so we also add the dependence package in tools repo.
    
    Depends-On: https://review.opendev.org/c/starlingx/tools/+/771744
    
    Story: 2008532
    Task: 41638
    Change-Id: Ic25b4404475a6f914e5a524db7d60d7e9dcffc85
    Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>

commit 8ec4e97b34e49d2ad212bc16f7863fd83eff6f8e
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:26:44 2020 -0500

Add auto-version for remaining stx/integ packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Change-Id: I9b40cd7e41c0cd713b73741ac3c8cab41d358642
    Story: 2008455
    Task: 41461
    Signed-off-by: Don Penney <don.penney@windriver.com>

commit 46d8d8fdf1ec75d74df6e2f20dff5f31732b9dc7
Author: Robert Church <robert.church@windriver.com>
Date:   Sat Dec 12 01:08:54 2020 -0500

Add conditions to when RBD devices are unmounted
    
    ceph-preshutdown.sh is called as a post operation when docker is
    stopped/restarted. Based on current service dependencies, when docker is
    restarted this will also trigger a restart of containerd.
    
    Puppet manifests will restart containerd and docker for various
    operations both on system boot and during runtime operations when their
    configuration has changed.
    
    This update adds conditions to ensure that the RBD devices are only
    unmounted when the system is shutting down. This avoids the RBD backed
    persistent volumes from being forcibly removed from running pods and
    being remounted read-only during these restart scenarios.
    
    Change-Id: I7adfddf135debcc8bcaa1f93866e1a276b554c88
    Closes-Bug: #1901449
    Signed-off-by: Robert Church <robert.church@windriver.com>

commit d815cfe2f2003f4a64a5e0b5348b0e9daabb58df
Author: Nicolas Alvarez <nicolas.alvarez@windriver.com>
Date:   Thu Dec 3 17:42:31 2020 -0300

Uninstall SNMP RPM Host-Based from STX.
    
    Uninstall SNMP RPM Host-Based from starlingx/integ repo because it
    will be containerized.
    Also disable snmp from networking/lldpd/centos/lldpd.spec file.
    
    Story: 2008132
    Task: 41322
    Depends-On: https://review.opendev.org/761792
    Signed-off-by: Nicolas Alvarez <nicolas.alvarez@windriver.com>
    
    Change-Id: Ifda06a5eb3bd0ec9683823b643e6d9cc0e7c97e2

commit 39f6f92cc888b3893b4d4717fba1599056382997
Author: Angie Wang <angie.wang@windriver.com>
Date:   Mon Sep 28 11:24:12 2020 -0400

Armada: add configurations for helm sql storage backend
    
    Configmap is the default helmv2 storage backend to store
    release information but its 1MB resource limit prevents
    scaling up stx openstack workers, so we want to use sql
    as helm storage backend.
    
    Update armada chart to support sql storage backend
    configuration for helm/tiller.
    
    Upstream review: https://review.opendev.org/#/c/759899/
    
    Partial-Bug: 1887677
    Change-Id: Ifcb7f28e99413be5a0dbfddf684ca064866860f5
    Signed-off-by: Angie Wang <angie.wang@windriver.com>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1793527 Edit
redhat-bugs #1819868
[CLOSED ERRATA] Edit
auto-github-golang-go #40213 Edit

Bug watches keep track of this bug in other bug trackers.