cert-manager fails to apply after controller-0 is rebooted

Bug #2003198 reported by Dan Voiculeasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
Scenario for this issue: STX.5.0 Release
In an AIO-DX reboot stand-by controller. When it comes up it becomes active(don't know what triggered this). Sysinv comes up, sees that alarms to reapply the apps were present, starts to apply the apps. In ~20 seconds from the startup, while cert-manager is still applying, sysinv is killed/restarted(don't know what triggered this). On the next sysinv start, it resets the status of 'cert-manager' to 'apply-failed' instead of 'uploaded'. Either way ('apply-failed' or 'uploaded') will get the app out of auto-managed state and will require manual intervention.

Severity
--------
Major: requires manual intervention to reapply the app

Steps to Reproduce
------------------
Read the brief description, it describes the scenario and claims there are 2 unknowns on how to reproduce.
The steps would be 'sudo reboot' on stand-by controller but there are some conditions I don't know how to achieve.
For what we care we can emulate by raising app reapply alarms + sysinv conductor restarts.

Expected Behavior
------------------
First: status of 'cert-manager' should be reset to 'uploaded' not 'apply-failed'.
Second: We may want to keep the apps auto-managed. For example 'cert-manager' should not need manual intervention to get out of 'uploaded' state. Currently we have limitations on some apps, which have auto-apply feature disabled.

Actual Behavior
----------------
cert-manager has status 'apply-failed'

Reproducibility
---------------
Seen once

System Configuration
--------------------
AIO-DX, but can affect any multi-node.

Branch/Pull Time/Commit
-----------------------
STX.5.0

Last Pass
---------
N/A

Timestamp/Logs
--------------

Test Activity
-------------
Production

Workaround
----------
Manually apply cert-manager

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
Dan Voiculeasa (dvoicule) wrote :

Probably cert-manager and nginx-ingress-controller auto-apply feature is disabled because they are manually applied during bootstrap. We could enable auto-apply feature for these apps checking if it is not during bootstrap and other possible conditions (the framework already accounts for restore and upgrades).
platform-integ-apps has the auto-apply feature enabled.
Don't know about other apps, will assume they don't have it enabled.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/870960

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cert-manager-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/cert-manager-armada-app/+/871081
Committed: https://opendev.org/starlingx/cert-manager-armada-app/commit/0a65cd0d90370c11ff7f7cf334c5f42fd9d7d3dd
Submitter: "Zuul (22348)"
Branch: master

commit 0a65cd0d90370c11ff7f7cf334c5f42fd9d7d3dd
Author: Dan Voiculeasa <email address hidden>
Date: Thu Jan 19 11:07:44 2023 +0200

    Add lifecycle hook and configure one semnatic check

    If sysinv-conductor is restarted while cert-manager is running it's
    status will be changed from 'applying' to 'uploaded'[1]. 'applying'
    state can be obtained from an 'applied' state by framework
    automatically re-applying the app. In fact it is desired to keep
    the 'applied' state without manual intervention.

    Purpose of this commit is to allow this app to be auto applied from
    'uploaded' state.

    This is just extending the base lifecycle from conductor, configuring
    the semantic check for auto apply.

    Tests on AIO-SX:
    PASS: bootstrap
          cert-manager is in 'applied' state
    PASS: deploy, obtain unlocked enabled available
          cert-manager is in 'applied' state
    PASS: after unlock force uploaded state and observe app auto-applied
    PASS: system application-remove is rejected
    PASS: system application-remove --force is allowed
    PASS: emulate bootstrap by creating /var/run/.ansible_bootstrap,
          and observe app is not auto-applied from uploaded state

    [1]: https://review.opendev.org/c/starlingx/config/+/870960
    Closes-Bug: 2003198
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I37a9c740f0b7906c9f1fb1a0be9bf0b117a6df03

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.apps
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/870960
Committed: https://opendev.org/starlingx/config/commit/500fbaa133b4a1848251b20550dc1971a7e6b3af
Submitter: "Zuul (22348)"
Branch: master

commit 500fbaa133b4a1848251b20550dc1971a7e6b3af
Author: Dan Voiculeasa <email address hidden>
Date: Wed Jan 11 17:29:12 2023 +0200

    AppFwk: Load metadata before clearing stuck application

    Restarting sysinv while an application is applying will result in a
    wrong reset status. For example cert-manager status is reset to
    'apply-failed' instead of 'uploaded'.

    When sysinv is restarted, app operations that are in progress are
    reset. When apps were decoupled from sysinv [1], a requirement to
    have the app metadata loaded was introduced.

    Tests on AIO-SX:
    PASS: deploy, unlocked enabled available
    PASS: forced 'cert-manager' to be 'applying', forced sysinv conductor
    restart, observed status was reset to 'uploaded'.

    [1]: https://review.opendev.org/c/starlingx/config/+/774292/10/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py#333
    Partial-Bug: 2003198
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: Ibefc6362c7a7f03571be3cf35b6592cf0c68bca3

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.