FluxCD: Application-apply rejected with another operation (install/upgrade/rollback) is in progress

Bug #1997368 reported by Dan Voiculeasa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
When applying apps, sometimes the operation gets rejected stating '''Failed during apply :Helm upgrade failed: another operation (install/upgrade/rollback) is in progress'''

Investigation:
--------------
Why helm gets in a bad state is still a mistery, but I manually forced helm into a bad state.
- Taint the node, so no pods can be scheduled.
- Call helm upgrade command and hit CTRL+C after a few seconds
- Change a helm chart override and apply the new fluxcd configuration using kubectl.
- Wait or trigger flux reconciliation for helmrelease resource.

At this point FluxCD will complain similar to this.
2022-11-22T10:59:35.921Z DEBUG events Normal {"object": {"kind":"HelmRelease","namespace":"kube-system","name":"rbd-provisioner","uid":"ef28e8ed-206b-41bb-adc4-f16ff7894fff","apiVersion":"helm.toolkit.fluxcd.io/v2beta1","resourceVersion":"227057"}, "reason": "error", "message": "reconciliation failed: Helm upgrade failed: another operation (install/upgrade/rollback) is in progress"}

Doing 'system application-update <app_with_release_in_bad_state>' now will reject the app apply.

Recovering from this requires the helm releases to be manually deleted/rollbacked or removing the app itself to remove bad helm releases.

Severity
--------
Critical: apps require manual intervention to recover.

Steps to Reproduce
------------------
Read investigation

Expected Behavior
------------------
Read investigation

Actual Behavior
----------------
Read investigation

Reproducibility
---------------
<Seen once>
Original issue is seen just a few times in the span of months.
<100%> when forcing helm into a bad state

System Configuration
--------------------
any

Branch/Pull Time/Commit
-----------------------
Assuming FluxCD day0(picking march 2022), up to 22 Nov 2022.

Last Pass
---------
???

Timestamp/Logs
--------------
sysinv 2022-10-27 14:28:07.347 532086 ERROR sysinv.conductor.kube_app [-] Application stx-openstack: rel
ease glance: Failed during apply :Helm upgrade failed: another operation (install/upgrade/rollback) is
in progress

Last Helm logs:

preparing upgrade for osh-openstack-glance

Test Activity
-------------
Developer Testing

Workaround
----------
1) restart conductor, remove app, apply app.
OR
2) wait for timeout or force app into apply-failed, remove app, apply app

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/865138
Committed: https://opendev.org/starlingx/config/commit/00c2129a16d94669ce5ba2f9320ed3e27784a788
Submitter: "Zuul (22348)"
Branch: master

commit 00c2129a16d94669ce5ba2f9320ed3e27784a788
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 21 14:08:12 2022 +0200

    AppFwk: Recover apply from helm operation in progress

    It is observed that when a helm release is in pending state, another
    helm release can't be started by FluxCD. FluxCD will not try to
    do steps to apply the newer helm release, but will just error.

    This prevents us from applying a new helm release over a release with
    pods stuck in Pending state (just an example).

    When the specific message for helm operation in progress is detected,
    attempt to recover by moving the older releases to failed state.
    Move inspired by [1].
    To do so, patch the helm secret for the specific release.
    As an optimization, trigger the FluxCD HelmRelease reconciliation right
    after.
    One future optimization we can do is run an audit to delete the helm
    releases for which metadata status is a pending operation, but release
    data is failed (resource that we patched in this commit).

    Refactor HelmRelease resource reconciliation trigger, smaller size.

    There are upstream references related to this bug, see [2] and [3].

    Tests on Debian AIO-SX:
    PASS: unlocked enabled available
    PASS: platform-integ-apps applied
    after reproducing error:
    PASS: inspect sysinv logs, see recovery is attemped
    PASS: inspect fluxcd logs, see that HelmRelease reconciliation is
    triggered part of recovery

    [1]: https://github.com/porter-dev/porter/pull/1685/files
    [2]: https://github.com/helm/helm/issues/8987
    [3]: https://github.com/helm/helm/issues/4558
    Closes-Bug: 1997368
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I36116ce8d298cc97194062b75db64541661ce84d

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.apps
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.