AppFmwk: FluxCD to FluxCD application update doesn't clean up chart releases that are no longer managed from the previous version

Bug #2019138 reported by Bob Church
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bob Church

Bug Description

Brief Description
-----------------
While testing application updates, I found that if the new version of the app stops managing a helm chart, the released chart version from the previous version is not cleaned up.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
Create an application update by removing one of the charts in the fluxcd-manifests/kustomization.yaml so that there is one less helm release expected to be deployed. (I used platform-integ-apps an removed the ceph-pools-audit)
With the new application tarball, perform and application-update

Expected Behavior
-----------------
After application-update only the helm releases that are expected to be deployed are present in the helm ls -a -A command. Any helm release only associated with the old application version is removed

Actual Behavior
---------------
All helm releases only associated with the old application are still present

Reproducibility
---------------
100% reproducable

System Configuration
--------------------
Any (but observed on an AIO-DX)

Last Pass
---------
Previously worked with Armada based applications

Timestamp/Logs
--------------
Error would be produced is:
sysinv 2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app [-] Helm operation failure: Failed to delete release: Helm operation failure: Failed to delete release: Error: release: "stx-ceph-pools-audit" not found
command terminated with exit code 1
: sysinv.common.exception.HelmTillerFailure: Helm operation failure: Failed to delete release: Helm operation failure: Failed to delete release: Error: release: "stx-ceph-pools-audit" not found
command terminated with exit code 1
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app Traceback (most recent call last):
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app File "/usr/lib/python3/dist-packages/sysinv/helm/utils.py", line 200, in delete_helm_release
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app raise exception.HelmTillerFailure(
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app sysinv.common.exception.HelmTillerFailure: Helm operation failure: Failed to delete release: Error: release: "stx-ceph-pools-audit" not found
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app command terminated with exit code 1
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app During handling of the above exception, another exception occurred:
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app Traceback (most recent call last):
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app File "/usr/lib/python3/dist-packages/sysinv/conductor/kube_app.py", line 3014, in perform_app_update
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app helm_utils.delete_helm_release(from_chart.release)
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app File "/usr/lib/python3/dist-packages/sysinv/helm/utils.py", line 209, in delete_helm_release
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app raise exception.HelmTillerFailure(
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app sysinv.common.exception.HelmTillerFailure: Helm operation failure: Failed to delete release: Helm operation failure: Failed to delete release: Error: release: "stx-ceph-pools-audit" not found
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app command terminated with exit code 1
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app
2022-11-07 18:59:44.750 80063 ERROR sysinv.conductor.kube_app

Alarms
------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
Use helm delete command to clean up old helm releases. Use kubectl to clean up helmrelease CRDs

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/882947

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/882947
Committed: https://opendev.org/starlingx/config/commit/2785f64e54c425f103600e9a91559555cf748a03
Submitter: "Zuul (22348)"
Branch: master

commit 2785f64e54c425f103600e9a91559555cf748a03
Author: Robert Church <email address hidden>
Date: Wed May 10 10:15:08 2023 -0500

    AppFrmwk: Cleanup unique helm releases over update

    When updating an application some helm releases are unique to a specific
    application version. This requires that a when an application
    successfully or unsuccessfully updates, specific helm releases must be
    removed by the framework as it will not be managed by the new (or old)
    version of the application that is being applied during update (or
    recovery).

    Changes include:
     - When helm releases are cleaned up via delete_helm_release() also
       remove the FluxCD helmrelease CRD so that the helm controller will
       not re-deploy the helm release.
     - Refactor calls to delete_helm_v3_release() to delete_helm_release()
       as helm v2 is no longer supported, so differentiation is irrelevant.
     - Refactor retrieve_helm_releases() by removing the wrapper function
       and renaming retrieve_helm_v3_releases().
     - Refactor HelmTillerFailure exception to HelmFailure. Tiller is no
       longer present in the system as helm v3 is tillerless and the Armada
       pod containing the Tiller container is no longer supported.
     - Fix issue that when an application does not specify any images in any
       chart values.yaml an exception is thrown when applying the
       application due to a null dict being written to the application
       images file.

    Test Plan:
    PASS - Build, install, deploy AIO-SX
    PASS - Build custom platform-integ-apps without the ceph audit chart.
           Perform application update and confirm that the unique helm
           release from the previous application version is properly
           cleaned up.

    Closes-Bug: #2019138
    Signed-off-by: Robert Church <email address hidden>
    Change-Id: I3a14f8f6b990351f8415a3fe3ce0b9637672dbcb

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.