Installation failure if controller-0 unlocked with management on loopback interface

Bug #1830082 reported by Bart Wensley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Teresa Ho

Bug Description

Brief Description
-----------------
When installing a standard (2+2) system with ansible, if controller-0 is unlocked while the management network is assigned to the loopback interface, the installation cannot be completed. No subsequent nodes can be brought up (since there is no management interface to pxeboot from) and the management network cannot be moved to a physical interface (since the controller cannot be locked).

Severity
--------
Major: controller-0 must be re-installed to recover

Steps to Reproduce
------------------
Install controller-0 (standard config) and unlock it before moving the management network to a physical interface.

Expected Behavior
------------------
The solution is probably to allow controller-0 to be locked in this specific case so the management network can be moved to a physical interface.

Actual Behavior
----------------
See above

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Mult-node system

Branch/Pull Time/Commit
-----------------------
SW_VERSION="19.01"
BUILD_TARGET="Unknown"
BUILD_TYPE="Informal"
BUILD_ID="n/a"

JOB="n/a"
BUILD_BY="bwensley"
BUILD_NUMBER="n/a"
BUILD_HOST="yow-bwensley-lx-vm2"
BUILD_DATE="2019-05-21 14:19:25 -0500"

BUILD_DIR="/"
WRS_SRC_DIR="/localdisk/designer/bwensley/starlingx-1/cgcs-root"
WRS_GIT_BRANCH="HEAD"
CGCS_SRC_DIR="/localdisk/designer/bwensley/starlingx-1/cgcs-root/stx"
CGCS_GIT_BRANCH="HEAD"

Last Pass
---------
Never

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Did you follow the instructions to update the mgmt interface?

system host-if-delete controller-0 lo
system host-if-modify -n mgmt -c platform --networks mgmt controller-0 eno2
system host-if-modify -m 9216 -n clusterhst -c platform --networks cluster-host controller-0 ens1f1
system host-if-modify -n oam -c platform --networks oam controller-0 eno1

Changed in starlingx:
status: New → Incomplete
tags: added: stx.config
Revision history for this message
Bart Wensley (bartwensley) wrote :

I mistakenly unlocked controller-0 without following those instructions (since those instructions do not exist on the installation wiki). Once I did that, it is impossible to recover without a re-install.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Given the system didn't recover, marking as release gating. Suggest to add a semantic check to ensure the mgmt and cluster interfaces are assigned to a valid interface

Changed in starlingx:
importance: Undecided → Medium
status: Incomplete → Triaged
tags: added: stx.2.0
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
tags: added: stx.networking
Changed in starlingx:
assignee: Tee Ngo (teewrs) → Teresa Ho (teresaho)
importance: Medium → Low
Revision history for this message
Tee Ngo (teewrs) wrote :

Like this LP :). In general, we certainly need to reject host-unlock request if the required controller-0 configuration steps (post Ansible bootstrap) have not been carried out.

I finally got the green light to update the installation wikis today. They have been updated. Prior to that, the instructions had been sent to StarlingX doc team, PV and the community via email on May 9th as wiki update was disallowed at that time.

Revision history for this message
Matt Peters (mpeters-wrs) wrote :

It is not just validating that the required interfaces are configured. The user must be able to lock controller-0 as long as it is still in a simplex state (/etc/platform/simplex exists). This is required to handle reconfiguration of any host level data, including fixing possible interface configuration issues.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per discussion with Matt, changing the priority to Medium given the system config error cannot be corrected once the issue occurs.

Changed in starlingx:
importance: Low → Medium
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679837

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/679837
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=94d335c7e41271241ecb1b03723168f1e17755cb
Submitter: Zuul
Branch: master

commit 94d335c7e41271241ecb1b03723168f1e17755cb
Author: Teresa Ho <email address hidden>
Date: Wed Aug 21 10:27:32 2019 -0400

    Allow locking simplex controller

    After installing a controller, if it is unlocked before reconfiguring
    the required interfaces, the installation cannot be completed.
    This commit to allow locking the controller while it is in simplex
    state. This commit also reject unlocking controller in a duplex system
    if the management and cluster-host interfaces are on the loopback
    interface.

    After making the active controller on a duplex system return to a
    simplex state by locking and removing a standby controller, the address
    of the cluster-host interface was not unallocated properly. This commit
    corrected this problem.

    Closes-Bug: 1830082

    Change-Id: I8cb5d1dfd7d4ba73f40aefe3242ca32d91b4e7e8
    Signed-off-by: Teresa Ho <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.