Comment 1 for bug 1997368

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Submitter: "Zuul (22348)"
Branch: master

commit 00c2129a16d94669ce5ba2f9320ed3e27784a788
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 21 14:08:12 2022 +0200

    AppFwk: Recover apply from helm operation in progress

    It is observed that when a helm release is in pending state, another
    helm release can't be started by FluxCD. FluxCD will not try to
    do steps to apply the newer helm release, but will just error.

    This prevents us from applying a new helm release over a release with
    pods stuck in Pending state (just an example).

    When the specific message for helm operation in progress is detected,
    attempt to recover by moving the older releases to failed state.
    Move inspired by [1].
    To do so, patch the helm secret for the specific release.
    As an optimization, trigger the FluxCD HelmRelease reconciliation right
    One future optimization we can do is run an audit to delete the helm
    releases for which metadata status is a pending operation, but release
    data is failed (resource that we patched in this commit).

    Refactor HelmRelease resource reconciliation trigger, smaller size.

    There are upstream references related to this bug, see [2] and [3].

    Tests on Debian AIO-SX:
    PASS: unlocked enabled available
    PASS: platform-integ-apps applied
    after reproducing error:
    PASS: inspect sysinv logs, see recovery is attemped
    PASS: inspect fluxcd logs, see that HelmRelease reconciliation is
    triggered part of recovery

    Closes-Bug: 1997368
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I36116ce8d298cc97194062b75db64541661ce84d