Apps take a long time to apply and the progress status remains at 0%

Bug #1995748 reported by Ghada Khalil
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Dan Voiculeasa

Bug Description

Brief Description
-----------------
This is a follow-up to LP: https://bugs.launchpad.net/starlingx/+bug/1994151

A specific solution was implemented for auditd to reduce the repository update interval and allow the app to apply in a timely fashion: https://review.opendev.org/c/starlingx/audit-armada-app/+/862504

However, the issue is applicable to other applications in the kube-system namespace (which has multiple apps in the same namespace.

As discussed w/ Bob Church in the review above, this needs further investigation to find a more generic solution.

Severity
--------
Minor - with the above fix, auditd is applying properly. This is tracking a better solution.

Steps to Reproduce
------------------
- Revert the review above
- system apply auditd

Expected Behavior
------------------
- apply completes in a timely fashion in the order of minutes

Actual Behavior
----------------
- apply takes more than an hour to complete

Reproducibility
---------------
Intermittent. Not seen on every apply; seems to be a timing issue related to the helm repository update interval

System Configuration
--------------------
Seen on a DX system, but unsure if that's related

Branch/Pull Time/Commit
-----------------------
reported in Debian build: 2022-09-01_18-00-06

Last Pass
---------
Intermittent

Timestamp/Logs
--------------
See https://bugs.launchpad.net/starlingx/+bug/1994151

Test Activity
-------------
Regression Testing

Workaround
----------
None

Ghada Khalil (gkhalil)
tags: added: stx.apps
description: updated
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
Ghada Khalil (gkhalil)
description: updated
Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/865856
Committed: https://opendev.org/starlingx/config/commit/b83b0e70fef2073e0d56e4a18fc2fb61fd973b84
Submitter: "Zuul (22348)"
Branch: master

commit b83b0e70fef2073e0d56e4a18fc2fb61fd973b84
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 28 16:39:21 2022 +0200

    AppFwk: Add FluxCD recovery logic for apply operation [2]

    Add some robustenss to the app framework. It is observed that the
    framework can reach a state where a helm charts are not uploaded to
    HelmRepository. This leads to app framework waiting for reconciliation
    of HelmRepository to be fired. Currently the reconciliation interval
    is set to 60 minutes for every app checked.

    Issue becomes obvious when udating the app to use newer HelmCharts.
    HelmChart observed status is '''chart pull error: failed to get chart
    version for remote reference: no chart name found''' which is a
    string the recovery logic will attempt to recover from.

    Update recovery logic to trigger a HelmRepository reconciliation
    before a HelmChart reconciliation.

    Skip CentOS testing because we use the same fluxcd and kubernetes.
    The only difference is the python kubernetes library, but the
    implementation does not use any new API calls.

    Tests on AIO-SX Debian:
    PASS: AIO-SX unlocked enabled available
    PASS: inspect logs to see HelmRepository
          reconciliation is triggered by the recovery logic.

    Closes-Bug: 1995748
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I34ae586a5a267b636164d011b5fa5d44ce8c9a6c

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to config (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/866862

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/b9f34317963766ca8c1492cd3b5d8a702636c371
Submitter: "Zuul (22348)"
Branch: master

commit b9f34317963766ca8c1492cd3b5d8a702636c371
Author: Dan Voiculeasa <email address hidden>
Date: Tue Dec 6 19:49:02 2022 +0200

    Upversion FluxCD release to v0.37.0

    There are several FluxCD issues which were attempted to be fixed by
    implementing recovery logic into app framework. A new issue was
    observed that may be hit more often because of the recovery logic.

    Flux won't attempt to finish updating a HelmRelease resource to
    Ready=True but only display a message containing this signature:
    '''the object has been modified...'''. The resource must be correctly
    updated by Flux.
    Aside from this the recovery logic was put in place for HelmChart
    resources that are in Ready=False state, but this should have been
    handled by Flux.

    I believe the 2 issues described above map to [1] and [2].
    To pull [1] and [2] updated FluxCD release to latest available [3].
    Release is composed of manifests and container image used.
    Update containers used: helm-controller from v0.15.0 to v0.27.0,
    source-controller from v0.20.1 to v0.32.1.
    Update flux crds, rbac, deployment.
    Changelog claims we get k8s 1.25 support part of the upversion.

    Disclaimer for tests:
    1) Recovery logic concerned with flipping spec.suspend was removed.
    Introduced by ongoing [4]
    2) optimization by flipping spec.suspend was removed
    Introduced by ongoing[4]
    3) cert-manager, nginx-ingress-controller, platform-integ-apps had the
    reconciliation interval decreased to 1m to allow Flux to manage the
    resources by itself in a reasonable time interval.
    There will be future commits per app updating reconciliation interval.

    Tests on AIO-SX:
    PASS: bootstrap
    PASS: unlocked enabled available
    PASS: apps applied
    PASS: tested on k8s 1.24.4 and 1.21.8, inherit that versions between
          should be OK
    PASS: delete pods and see they are recreated and no errors
    PASS: inspect flux pod logs for errors
    PASS: re-test known trigger for 1996747 and 1995748

    [1]: https://github.com/fluxcd/source-controller/pull/703
    [2]: https://github.com/fluxcd/source-controller/pull/202
    [3]: https://github.com/fluxcd/flux2/releases/tag/v0.37.0
    [4]: https://review.opendev.org/c/starlingx/config/+/866862
    Related-Bug: 1995748
    Related-Bug: 1996747
    Closes-Bug: 1999032
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: Id295599a2946f48081e7d27e2ab8e06063c3c88d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/866862
Committed: https://opendev.org/starlingx/config/commit/85e3b47912d1aab8f771648fd3a76b7d1402bd2e
Submitter: "Zuul (22348)"
Branch: master

commit 85e3b47912d1aab8f771648fd3a76b7d1402bd2e
Author: Dan Voiculeasa <email address hidden>
Date: Mon Dec 5 21:55:51 2022 +0200

    AppFwk: Remove recovery logic based on spec.suspend

    Because of FluxCD upversion described in [1], we don't need the recovery
    logic flipping spec.suspend. Flux is supposed to properly
    reconciliate the resources.

    Remove recovery logic concerned with flipping spec.suspend was
    removed.
    Remove optimizations for triggering reconciliation by flipping
    spec.suspend.

    Disclaimer for tests:
    1) This was applied on to of [1].
    2) cert-manager, nginx-ingress-controller, platform-integ-apps had the
    reconciliation interval decreased to 1m to allow Flux to manage the
    resources by itself in a reasonable time interval.
    There will be future commits per app updating reconciliation interval.

    Tests on AIO-SX:
    PASS: bootstrap
    PASS: unlocked enabled available
    PASS: apps applied
    PASS: inspect flux pod logs for errors
    PASS: re-test known trigger for 1996747 and 1995748
    PASS: re-test known trigger 1997368

    [1]: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
    Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
    Related-Bug: 1995748
    Related-Bug: 1996747
    Related-Bug: 1997368
    Partial-Bug: 1999032
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I932d85d8b366479b2c1d2c88a0acf7fad219b131

Ghada Khalil (gkhalil)
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.