AIO Simplex controller degraded

Bug #1859859 reported by Cristopher Lemus
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Critical
Don Penney

Bug Description

Brief Description
-----------------
During the initial setup of a Simplex configuration, the controller remains on degraded status after the unlock.

Severity
--------
Critical

Steps to Reproduce
------------------
Follow up documentation to complete a simplex install.

Expected Behavior
------------------
After the unlock, controller-0 exits the degrade condition after a few minutes.

Actual Behavior
----------------
After the unlock, controller-0 remains on degraded status.

Reproducibility
---------------
100% reproducible on baremetal.

System Configuration
--------------------
Simplex, baremetal.

Branch/Pull Time/Commit
-----------------------
20200115T023003Z

Last Pass
---------
Passed on build from Jan/11.

Timestamp/Logs
--------------
Some outputs: http://paste.openstack.org/show/788430/ the relevant error seems to be related with a failure on 'sw-patch-agent' process. Full log attached to this launchpad (I was not able to login into https://files.starlingx.kube.cengn.ca/).

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Bruce Jones (brucej)
Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.4.0
Revision history for this message
Don Penney (dpenney) wrote :

How are you doing this installation? From the collected logs, looking through the anaconda logs, it looks like you are maybe using the net_smallsystem_ks.cfg file as the basis for a remote installation kickstart. The net_* kickstart files are for installation of nodes from the active controller, not for the initial installation of controller-0. The initial installation kickstart (smallsystem_ks.cfg on the ISO) provides some extra steps for the installation of the first controller, including software repo setup/mirroring.

For installation from a network server, there is a pxeboot_setup.sh utility in the ISO and pxeboot_*.cfg kickstart templates that have the extra steps needed for the installation of the first controller.

The patch-agent is failing because it cannot find the required software groups, an indication that the software repos were not properly mirrored as part of the post-installation setup that is in the kickstarts of the initial controller install:

2020-01-15T12:57:56: sw-patch-agent[1053928]: base.py(415): WARNING: Failed to synchronize cache for repo 'platform-base', ignoring this repo.
2020-01-15T12:57:56: sw-patch-agent[1053928]: patch_agent.py(493): ERROR: Could not find software group: updates-controller-worker
2020-01-15T12:57:56: sw-patch-agent[1053928]: patch_functions.py(68): ERROR: Uncaught exception
Traceback (most recent call last):
  File "/usr/sbin/sw-patch-agent", line 15, in <module>
    main()
  File "/usr/lib64/python2.7/site-packages/cgcs_patch/patch_agent.py", line 870, in main
    pa.query()
  File "/usr/lib64/python2.7/site-packages/cgcs_patch/patch_agent.py", line 495, in query
    for pkg in pkggrp.packages_iter():
AttributeError: 'NoneType' object has no attribute 'packages_iter'

You can see these additional pieces for the first controller installation in the following kickstart sections:
https://opendev.org/starlingx/metal/src/branch/master/bsp-files/kickstarts/post_usb_controller.cfg
https://opendev.org/starlingx/metal/src/branch/master/bsp-files/kickstarts/post_pxeboot_controller.cfg

I downloaded the 20200115T023003Z ISO and verified installation works as expected.

description: updated
Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Hi Don,

Thanks for the details, we were able to fix simplex install.

With the information that you provided, the missing packages causes the failure. Our deployment system is made via network installs. So, we updated our deployment system to copy the required files into simplex (following the same logic that is implemented on pxeboot_setup.sh).

The copy of these packages was already implemented on all other configurations, and it's a new requirement for simplex, we were doing network installs since 2.0 and we didn't faced this issue on simplex.

Thanks a lot for the help, I asume that this bug should be marked as invalid.

Revision history for this message
Don Penney (dpenney) wrote :

This is not a new requirement for simplex - the kickstarts for installing the initial controller have always had these additional postinstall steps built in. If you are using the net_*.cfg files for initial installation of controller-0, whether it's standard or AIO, it's an invalid method. Having a manual step post-install to account for this workaround is not appropriate, because we could easily end up adding additional steps that are only for this initial controller installation and then your install method is broken again. If your intent is to use a network boot for the install, then I would highly recommend that you look at using pxeboot_setup.sh for setting up the boot source. You need to install using the correct kickstart files.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Don Penney (dpenney)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Don, I believe you can mark the LP as Invalid given it's a procedural issue.

tags: added: stx.config
Don Penney (dpenney)
Changed in starlingx:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.