Bootstrap replay fails when changing mgmt subnet

Bug #1925668 reported by David Sullivan
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
zhipeng liu

Bug Description

Brief Description
Replaying the bootstrap playbook with a new management subnet fails

Severity
Major

Steps to Reproduce
Run the bootstrap playbook with one management subnet (default or specified)
Specify a new management subnet.
Rerun the bootstrap playbook

Expected Behavior
Playbook completes

Actual Behavior
Playbook fails

Reproducibility
Only hit once, but should be Reproducible

System Configuration
Hit in AIO-SX, should be reproducible in all configurations.

Branch/Pull Time/Commit
2021-04-14_00-00-08

Last Pass
Unknown when last passed. Should be before https://opendev.org/starlingx/ansible-playbooks/commit/41de2e52db9985d84397d7ba56a59bbeaa9cf88f

Timestamp/Logs

2021-04-14 21:54:45,952 p=194441 u=sysadmin | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/local/bin/puppet-manifest-apply.sh", "/tmp/hieradata", "192.168.199.2", "controller", "runtime", "/tmp/etcd.yml"], "delta": "0:00:30.077996", "end": "2021-04-14 21:54:45.917552", "msg": "non-zero return code", "rc": 1, "start": "2021-04-14 21:54:15.839556", "stderr": "rsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]\nrsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]\nrsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]", "stderr_lines": ["rsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)", "rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]", "rsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)", "rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]", "rsync: link_stat \"/tmp/hieradata/192.168.199.2.yaml\" failed: No such file or directory (2)", "rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]"], "stdout": "drwxr-xr-x 180 2021/04/14 21:54:15 hieradata\ndrwxr-xr-x 180 2021/04/14 21:54:15 hieradata\n[FAILED]\nExiting, failed to rsync hieradata", "stdout_lines": ["drwxr-xr-x 180 2021/04/14 21:54:15 hieradata", "drwxr-xr-x 180 2021/04/14 21:54:15 hieradata", "[FAILED]", "Exiting, failed to rsync hieradata"]}

Test Activity
Feature Testing

Workaround
None

Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as stx.5.0/high as this is a regression introduced by the etcd changes for that release

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → zhipeng liu (zhipengs)
tags: added: stx.5.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the stx release meeting (2021-05-19), agreed not to hold up the stx.5.0 release on this as this is a reconfiguration scenario on initial boostrap, so can be avoided.

Changed in starlingx:
importance: High → Medium
tags: added: stx.6.0
removed: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/787942
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/20e44c6dd71758f89bbfe88c6204e777c17deb5d
Submitter: "Zuul (22348)"
Branch: master

commit 20e44c6dd71758f89bbfe88c6204e777c17deb5d
Author: Zhipeng Liu <email address hidden>
Date: Tue Apr 27 01:40:16 2021 +0800

    Fix bootstrap replay failure when changing mgmt subnet

    After mgmt subnet is changed, we use previous controller_0
    address for etcd puppet apply to avoid resycing an nonexistent
    hieradata file.

    Change-Id: Ie31c48153af30df240237013dd51bfffea5213cd
    Closes-Bug: 1925668
    Signed-off-by: Zhipeng Liu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792271
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Submitter: "Zuul (22348)"
Branch: master

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support bootstrap replay with networking changes

    Currently bootstrap playbook replay will fail if the management or
    cluster host networks are changed. To resolve this a couple of changes
    are needed:

    * Restart the sysinv agent and wait until it is ready. The sysinv agent
      uses the current management ip for the rabbitMQ connection and
      applying runtime manifests. The process needs to be restarted to
      resync that data.

    * Copy the etcd certs to the /opt/platform on replay. The etcd-server
      certs are regenerated on replay. When the cluster host network changed
      the SAN in the certs under /opt/platform were out of date resulting in
      kube-apiserver failures on controller-0 unlock.

    Closes-Bug: 1925668
    Signed-off-by: David Sullivan <email address hidden>
    Change-Id: I228321a2540a0024cd217ed844feb54be9ae3b29

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/792829

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (r/stx.5.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (r/stx.5.0)

Change abandoned by "zhipeng liu <email address hidden>" on branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/793042
Reason: Abandon it according to the comment from Ghada, thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (f/centos8)
Download full text (52.5 KiB)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/794324
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/163ec9989cc7360dba4c572b2c43effd10306048
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 4e96b762f549aadb0291cc9bcf3352ae923e94eb
Author: Mihnea Saracin <email address hidden>
Date: Sat May 22 15:48:19 2021 +0000

    Revert "Restore host filesystems with collected sizes"

    This reverts commit 255488739efa4ac072424b19f2dbb7a3adb0254e.

    Reason for revert: Did a rework to fix https://bugs.launchpad.net/starlingx/+bug/1926591. The original problem was in puppet, and this fix in ansible was not good enough, it generated some other problems.

    Change-Id: Iea79701a874effecb7fe995ac468d50081d1a84f
    Depends-On: I55ae6954d24ba32e40c2e5e276ec17015d9bba44

commit c064aacc377c8bd5336ceab825d4bcbf5af0b5e8
Author: Angie Wang <email address hidden>
Date: Fri May 21 21:28:02 2021 -0400

    Ensure apiserver keys are present before extract from tarball

    This is to fix the upgrade playbook issue that happens during
    AIO-SX upgrade from stx4.0 to stx5.0 which introduced by
    https://review.opendev.org/c/starlingx/ansible-playbooks/+/792093.
    The apiserver keys are not available in stx4.0 side so we need
    to ensure the keys under /etc/kubernetes/pki are present in the
    backed-up tarball before extracting, otherwise playbook fails
    because the keys are not found in the archive.

    Change-Id: I8602f07d1b1041a7fd3fff21e6f9a422b9784ab5
    Closes-Bug: 928925
    Signed-off-by: Angie Wang <email address hidden>

commit 0261f22ff7c23d2a8608fe3b51725c9f29931281
Author: Don Penney <email address hidden>
Date: Thu May 20 23:09:07 2021 -0400

    Update SX to DX migration to wait for coredns config

    This commit updates the SX to DX migration playbook to wait after
    modifying the system mode to duplex until the runtime manifest that
    updates coredns config has completed. The playbook will wait for up to
    20 minutes to allow for the possibilty that sysinv has multiple
    runtime manifests queued up, each of which could take several minutes.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/792494
    Depends-On: https://review.opendev.org/c/starlingx/config/+/792496
    Change-Id: I3bf94d3493ae20eeb16b3fdcb27576ee18c0dc4d
    Closes-Bug: 1929148
    Signed-off-by: Don Penney <email address hidden>

commit 7c4f17bd0d92fc1122823211e1c9787829d206a9
Author: Daniel Safta <email address hidden>
Date: Wed May 19 09:08:16 2021 +0000

    Fixed missing apiserver-etcd-client certs

    When controller-1 is the active controller
    the backup archive does not contain
    /etc/etcd/apiserver-etcd-client.{crt, key}

    This change adds a new task which brings
    the certs from /etc/kubernetes/pki

    Closes-bug: 1928925
    Signed-off-by: Daniel Safta <email address hidden>
    Change-Id: I3c68377603e1af9a71d104e5b1108e9582497a09

commit e221ef8fbe51aa6ca229b584fb5632fe512ad5cb
Author: David Sullivan <email address hidden>
Date: Wed May 19 16:01:27 2021 -0500

    Support boo...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.