After DOR, configuring a route causes interfaces to go down on system controller

Bug #1895693 reported by Ghada Khalil
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Don Penney

Bug Description

Brief Description
-----------------
It was observed that the DC system controllers went for a reboot loop after adding a subcloud. Further investigation showed that the networking scripts on the controllers were empty, which resulted in all the interfaces going down and the system becoming un-usable. The condition which triggers this bug involves an earlier DOR on the system.

From Don Penney:
The route-add runtime manifest is relying on cached networking puppet data from previous manifest apply. For a standard controller, if it reboots without an active controller (simplex controller reboot or a duplex DOR), the manifests do not get applied during the init, so no cached networking puppet data is stored. Then you do a route add, the config script runs, and it sees no interfaces in the puppet data, thinks that means they've all been deleted, and shuts them all down.
This was introduced by: https://review.opendev.org/703034

Severity
--------
Major

Steps to Reproduce
------------------
- Setup a duplex DC system controllers (or just duplex controllers)
- Perform a DOR
- In DC, add a new subcloud which will add a new route on the system controller
- If testing on a non-DC system, add a route using the system CLI cmd

Expected Behavior
------------------
system continues to be usable

Actual Behavior
----------------
The networking scripts are removed from the system controller, resulting in it going into a reboot loop

Reproducibility
---------------
Reproducible when following the exact steps above

System Configuration
--------------------
Duplex controllers or DC system

Branch/Pull Time/Commit
-----------------------
stx master, but issue exists in stx.4.0 as well as the code introducing the issue was introduced in that release

Last Pass
---------
This particular test was never intentionally run previously.

Timestamp/Logs
--------------

Test Activity
-------------
DC lab usage

Workaround
----------
none

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
tags: added: stx.config stx.networking
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.5.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking for both stx.5.0 & stx.4.0 given the system is not recoverable when the issue is hit.

tags: added: not-yet-in-r-stx40 stx.4.0
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/752081

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/752081
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=4d81ca178045edd7cfd2c9bae0e12e10b77c57b3
Submitter: Zuul
Branch: master

commit 4d81ca178045edd7cfd2c9bae0e12e10b77c57b3
Author: Don Penney <email address hidden>
Date: Tue Sep 15 12:02:53 2020 -0400

    Fix route config handling for DOR

    In a DOR (Dead Office Recovery, where all nodes reboot at once), both
    controllers are rebooting at the same time. This means that there is
    no active controller from which to retrieve puppet data in order to
    apply the controller manifests. As such, we have to be careful not to
    rely on having the controller manifests run on every controller boot,
    for things like launching services or any sort of changes in a
    volatile file system like /var/run.

    The route configuration optimization changes that were added in
    https://review.opendev.org/703034 inadvertently relied on existing
    network configuration data being cached in /var/run, however. As a
    result, a route configuration change after a DOR of a system with
    standard controllers would end up running with no cached network
    config data (an AIO system would have this data generated as part of
    applying the worker manifest), and the apply_network_config.sh utility
    would think that all network interfaces have been removed from the
    system. It would then proceed to apply that config, deleting the
    interfaces and taking down all networking.

    This commit enhances the apply_network_config.sh to introduce a
    --routes option to separate the route configuration operations from
    the rest of the networking config. When a route is added or deleted,
    then, only the route config changes are processed, ignoring network
    interfaces.

    Additionally, this adds a check in the interface section of the
    apply_network_config.sh utility to verify that at least 'lo' exists.
    Since the loopback interface should always exist, its absence would
    indicate that the interface config data is missing or corrupted, and
    is unsafe to apply.

    Change-Id: I5583ec916aee8117e5686cfb10fb18ddda4806b1
    Closes-Bug: 1895693
    Signed-off-by: Don Penney <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.