DC upgrade - Subcloud upgrade failed to populate /etc/network/

Bug #2044020 reported by Andre Kantek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Andre Kantek

Bug Description

Brief Description

DC upgrade - Subcloud upgrade failed to populate /etc/network/

Error:

TASK [optimized-restore/restore-data : Apply network runtime manifest to populate /etc/network/] ***
Wednesday 15 November 2023 16:31:28 +0000 (0:00:00.481) 0:14:59.746 ****
fatal: [subcloud2]: FAILED! => changed=true
  cmd: puppet-manifest-apply.sh /opt/platform/puppet/22.12/hieradata fdff:719a:bf60:1098::3 worker runtime /tmp/ansible.o8mrkygu/network_runtime.yml
  delta: '0:00:21.676148'
  end: '2023-11-15 16:31:38.128618'
  msg: non-zero return code
  rc: 1
  start: '2023-11-15 16:31:16.452470'
  stderr: ''
  stderr_lines: <omitted>
  stdout: |-
    Applying puppet runtime manifest...
    [WARNING]
    Warnings found. See /var/log/puppet/2023-11-15-16-31-16_runtime/puppet.log for details

/var/log/puppet/2023-11-15-16-31-16_runtime/puppet.log

  stdout_lines: <omitted>2023-11-15T16:31:38.087 ^[[1;31mError: 2023-11-15 16:31:37 +0000 /Stage[pre]/Platform::Network::Apply/Exec[wait-for-tentative]/returns: change from 'notrun' to ['0'] failed: '[ $(ip -6 addr sh | grep -c inet6.*tentative) -eq 0 ]' returned 1 instead of one of [0]^[[0m2023-11-15T16:31:38.089 ^[[mNotice: 2023-11-15 16:31:37 +0000 /Stage[main]/Platform::Anchors/Anchor[platform::networking]: Dependency Exec[wait-for-tentative] has failures: true^[[0m
Severity

<Critical: System/Feature is not usable after the defect>

Steps to Reproduce

Deploy systemcontroller with stx8.0
Import inactive load (stx6.0)
deploy subcloud with inactive load
manage subcloud
backup and prestage subcloud
upgrade subcloud
Expected Behavior

The subcloud should be upgraded successfully

Actual Behavior

The subcloud upgrade failed to populate network config at /etc/network

Reproducibility

seen once

System Configuration

DC

Alarms

no alarms available

Test Activity

Regression Testing

Workaround

no workarounds

Andre Kantek (akantek)
Changed in starlingx:
assignee: nobody → Andre Kantek (akantek)
Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.9.0 stx.networking stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/901207
Committed: https://opendev.org/starlingx/stx-puppet/commit/39f1517bc12f6e51c9733b71e4bf1d46e9040822
Submitter: "Zuul (22348)"
Branch: master

commit 39f1517bc12f6e51c9733b71e4bf1d46e9040822
Author: Andre Kantek <email address hidden>
Date: Thu Nov 16 17:45:13 2023 -0300

    Do not set floating address as deprecated during upgrade bootstrap

    During upgrade all floating addresses in /etc/hosts are configured
    in the loopback in the step "Bring up temporary addresses", located
    in
    https://opendev.org/starlingx/ansible-playbooks/src/branch/master/playbookconfig/src/playbooks/roles/optimized-restore/restore-configuration/tasks/restore-networking.yml

    Later the network will be restored at the step
    https://opendev.org/starlingx/ansible-playbooks/src/branch/master/playbookconfig/src/playbooks/roles/optimized-restore/restore-data/tasks/upgrade-networking.yml

    It will call the class "platform::network::runtime" and inside
    execute platform::network::network_address. This class marks the
    floating addresses as deprecated to not be selected as the source
    address in the definitive interfaces. Since it uses "ip addr replace"
    it adds the address even if it is not configured.

    Under certain conditions the presence of the same address in the
    loopback and network interface may result in a dadfailed address
    creating a failed test in puppet, preventing the upgrade to go
    further.

    But during the AIO-SX upgrade the floating addresses just need to be
    configured for internal comms. It can stay in the loopback, after
    the 1st unlock post upgrade all addressing will be corrected to the
    definitive interfaces.

    This change uses the flag /var/run/.network_upgrade_bootstrap, created
    the upgrade bootstrap to control the network configuration to avoid
    execution of this step.

    Test Plan
    [PASS] Upgrade an AIO-SX and validate that the network runtime puppet
           execution will pass the playbook execution.
    [PASS] lock/unlock AIO-SX to execute puppet network class outside the
           upgrade execution

    Closes-Bug: 2044020

    Change-Id: I4735d4bcb130918fad7ebb716ad5be2ba5b7e8fe
    Signed-off-by: Andre Kantek <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/901686

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/901686
Committed: https://opendev.org/starlingx/stx-puppet/commit/972b7cfc0cc3b263cf374cf87d1fcd0d4a7e3d50
Submitter: "Zuul (22348)"
Branch: master

commit 972b7cfc0cc3b263cf374cf87d1fcd0d4a7e3d50
Author: Andre Kantek <email address hidden>
Date: Wed Nov 22 16:15:53 2023 -0300

    Do not search for tentative addresses during AIO-SX upgrade

    During upgrade the platform addresses are first configured
    in the loopback to have them available during the bootstrap.

    But later the ansible step "Apply network runtime manifest to populate
    /etc/network/" runs and it activates the configured interfaces. This
    leads to a situation were the controller address from a platform
    interface may lead to a dad-failed event because the same address is
    configured in the interface and loopback.

    This change skips this verification during upgrade because the
    following reasons:
    1) during the upgrade bootstrap only the OAM interface is needed
    for outside communications
    2) The other platform interfaces will not be used in this stage
    3) After bootstrap the system is automatically unlocked and in the
    reboot return the tentative validation will take effect as the
    regular network configuration will be applied.

    Test Plan:
    [PASS] AIO-SX upgrade from CentOS version to Debian
    [PASS] Execute lock/unlock to test regular execution

    Closes-Bug: 2044020

    Change-Id: I1d61ef5260e3ddbfce8c3f01127a77a55524a746
    Signed-off-by: Andre Kantek <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.