SX upgrade failed during restore with an ssh timeout error during network manifest application

Bug #2040648 reported by Lucas Ratusznei Fonseca
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Lucas Ratusznei Fonseca

Bug Description

Brief Description
-----------------
SX upgrade from 6 to 7 failed during restore with an ssh timeout error during network manifest application.

Severity
--------
Major

Steps to Reproduce
------------------
Upgrade from 6 to 7

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
AIO-SX IPv6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/899296

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)
Download full text (5.7 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/899296
Committed: https://opendev.org/starlingx/stx-puppet/commit/580c3e0cd9d21977890e91f83cc2c9ebe77d0026
Submitter: "Zuul (22348)"
Branch: master

commit 580c3e0cd9d21977890e91f83cc2c9ebe77d0026
Author: Lucas Ratusznei Fonseca <email address hidden>
Date: Wed Oct 25 15:59:33 2023 -0300

    Prevent interfaces from being reset during upgrade

    This commit adds logic to spare interfaces that are already up from
    being brought down/up during upgrade bootstrap. This ensures that SSH
    connections are not lost and that services that depend on network
    connections don't break. Instead of resetting the interfaces, the
    script just ensures that IP addresses and routes associated to them are
    present in the kernel.

    The logic for dealing with bonding interfaces is also improved. A
    change to any interface related to a bonding will cause all of the
    related interfaces to be reset, ensuring a valid state.

    Test plan

    Systems
    - AIO-SX IPv4
    - AIO-SX IPv6

    TC1 - Ansible upgrade simulation
    --------------------------------

    This test simulates script behaviour during ansible upgrade bootstrap
    execution. The script must not reset the OAM, so the SSH connections
    don't drop.

    OAM interface scenarios
    1. Regular ethernet
    2. VLAN on top of a regular ethernet
    3. Bonding
    4. VLAN on top of a bonding

    OAM example parameters
    - IP address: 10.20.1.3/24 / fd00::a14:103/64
    - Gateway: 10.20.1.1 / fd00::1

    Setup
    1. Configure host via sysinv and unlock it, so that the OAM
       interface is up and properly configured.

    2. Prepare network runtime manifest to be applied

        mkdir -p /tmp/network_config
        cat > /tmp/network_config/network_runtime.yml <<EOF
        classes: [platform::network::runtime]
        EOF

    3. Create empty file /var/run/.network_upgrade_bootstrap, to
       simulate upgrade.

        touch /var/run/.network_upgrade_bootstrap

    Procedure
    1. Erase all IP adresses from the OAM
      > Ex.: ip address flush dev <interface>
    2. Erase default route from the OAM
      > Ex.: ip route delete default via 10.20.1.1 dev <interface>
    3. Erase interface configs, to simulate clean system
      > Ex.: rm -f /etc/network/interfaces.d/*
    4. Apply network runtime manifest
      > Ex.: /usr/local/bin/puppet-manifest-apply.sh \
             /opt/platform/puppet/22.12/hieradata/ 192.168.204.2 \
             controller runtime /tmp/network_config/network_runtime.yml
    5. Check /var/log/user.log, it must not contain the messages
       'Bringing <interface> down' and 'Bringing <interface> up' for
       the interfaces related to the OAM.

    6. Check that the OAM interface has its IP and default route properly
       configured.
      > Ex.: ip -br addr dev <interface>
             ip route | grep <interface>

    Tests

    PASS Scenario 1 - OAM on regular ethernet
    PASS Scenario 2 - OAM on VLAN on top of a regular ethernet
    PASS Scenario 3 - OAM on bonding
    PASS Scenario 4 - ...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Lucas Ratusznei Fonseca (lratuszn)
importance: Undecided → Medium
tags: added: stx.9.0 stx.networking stx.update
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

An issue was identified with the code changes for this LP related to the apply_network_config script. This results in first unlock failing on AIO-DX as reported in LP: https://bugs.launchpad.net/starlingx/+bug/2043133

A follow-up fix was merged on Nov 9. Review: https://review.opendev.org/c/starlingx/stx-puppet/+/900551

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.