Misleading app status after failed override update

Bug #2053276 reported by David Barbosa Bastos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
David Barbosa Bastos

Bug Description

Brief Description
-----------------
Application status was misleading after a failed override update with illegal values. Application should be in failed (apply-failed) state, and alarm should be raised accordingly. Instead, we're led to believe
that the update was completed successfully.

Severity
--------
Minor

Steps to Reproduce
------------------
1) The application must have the applied status.
2) Modify user overrides with illegal values: system helm-override-update nginx-ingress-controller ks-ingress-nginx kube-system --set controller.resources.requests.cpu=255 --set controller.resources.limits.cpu=10
3) Reapply the app and the status will change to applied even though helmrelease fails.

Expected Behavior
------------------
Application should be in failed(apply-failed) state

Actual Behavior
----------------
In seconds after application-apply, the progress becomes 'completed' and status 'applied'

Reproducibility
---------------
Reproducible 100%

System Configuration
--------------------
AIO-SX, AIO-DX+, Std

Branch/Pull Time/Commit
-----------------------
SW_VERSION="23.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2024-01-15_19-00-12"
SRC_BUILD_ID="1699"

Last Pass
---------
n/a

Timestamp/Logs
--------------
[sysadmin@controller-0 ~(keystone_admin)]$ system helm-override-update nginx-ingress-controller ks-ingress-nginx kube-system --set controller.resources.requests.cpu=255 --set controller.resources.limits.cpu=10
+----------------+------------------+
| Property | Value |
+----------------+------------------+
| name | ks-ingress-nginx |
| namespace | kube-system |
| user_overrides | controller: |
| | resources: |
| | limits: |
| | cpu: "10" |
| | requests: |
| | cpu: "255" |
| | |
+----------------+------------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system application-apply nginx-ingress-controller ; watch 'kubectl -n kube-system get hr;kubectl -n kube-system get pods -o wide;system application-list'
+---------------+-------------------------------------------+
| Property | Value |
+---------------+-------------------------------------------+
| active | True |
| app_version | 22.12-1 |
| created_at | 2023-12-01T20:48:06.848837+00:00 |
| manifest_file | fluxcd-manifests |
| manifest_name | nginx-ingress-controller-fluxcd-manifests |
| name | nginx-ingress-controller |
| progress | None |
| status | applying |
| updated_at | 2024-01-10T16:42:03.863933+00:00 |
+---------------+-------------------------------------------+
Please use 'system application-list' or 'system application-show nginx-ingress-controller' to view the current progress.
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl -n kube-system get hr;kubectl -n kube-system get pods -o wide;system application-list
NAME AGE READY STATUS
ceph-pools-audit 39d True Release reconciliation succeeded
cephfs-provisioner 39d True Release reconciliation succeeded
ks-ingress-nginx 39d False upgrade retries exhausted
rbd-provisioner 39d True Release reconciliation succeeded
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-567d594786-9bx9z 1/1 Running 1 (39d ago) 39d 172.16.192.84 controller-0 <none> <none>
calico-node-94gr2 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
ceph-pools-audit-28415070-kbcq6 0/1 Completed 0 14m 172.16.192.117 controller-0 <none> <none>
ceph-pools-audit-28415075-dn686 0/1 Completed 0 9m2s 172.16.192.72 controller-0 <none> <none>
ceph-pools-audit-28415080-nl8n5 0/1 Completed 0 4m2s 172.16.192.70 controller-0 <none> <none>
cephfs-nodeplugin-jhv55 2/2 Running 0 39d 192.168.204.2 controller-0 <none> <none>
cephfs-provisioner-5d558b94c9-4p8wf 5/5 Running 0 39d 172.16.192.90 controller-0 <none> <none>
cephfs-storage-init-cszql 0/1 Completed 0 39d 172.16.192.91 controller-0 <none> <none>
coredns-78dd5d75bd-ns4b5 1/1 Running 1 (39d ago) 39d 172.16.192.79 controller-0 <none> <none>
ic-nginx-ingress-ingress-nginx-controller-4k54f 1/1 Running 0 2m45s 192.168.204.2 controller-0 <none> <none>
kube-apiserver-controller-0 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
kube-controller-manager-controller-0 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
kube-multus-ds-amd64-mk72d 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
kube-proxy-gr89h 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
kube-scheduler-controller-0 1/1 Running 1 (39d ago) 39d 192.168.204.2 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-7c5cm 1/1 Running 1 (39d ago) 39d 172.16.192.86 controller-0 <none> <none>
rbd-nodeplugin-w58nn 2/2 Running 0 39d 192.168.204.2 controller-0 <none> <none>
rbd-provisioner-7b9ff47b89-xq6mh 6/6 Running 0 39d 172.16.192.92 controller-0 <none> <none>
rbd-storage-init-nhltt 0/1 Completed 0 39d 172.16.192.93 controller-0 <none> <none>
+--------------------------+----------+-------------------------------------------+------------------+----------+-----------+
| application | version | manifest name | manifest file | status | progress |
+--------------------------+----------+-------------------------------------------+------------------+----------+-----------+
| cert-manager | 22.12-8 | cert-manager-fluxcd-manifests | fluxcd-manifests | applied | completed |
| metrics-server | 22.12-1 | metrics-server-fluxcd-manifests | fluxcd-manifests | applied | completed |
| nginx-ingress-controller | 22.12-1 | nginx-ingress-controller-fluxcd-manifests | fluxcd-manifests | applied | completed |
| oidc-auth-apps | 22.12-6 | oidc-auth-apps-fluxcd-manifests | fluxcd-manifests | uploaded | completed |
| platform-integ-apps | 22.12-62 | platform-integ-apps-fluxcd-manifests | fluxcd-manifests | applied | completed |
| wr-analytics | 24.03-0 | wr-analytics-fluxcd-manifests | fluxcd-manifests | applied | completed |
+--------------------------+----------+-------------------------------------------+------------------+----------+-----------+
[sysadmin@controller-0 ~(keystone_admin)]$ kubectl -n kube-system describe hr ks-ingress-nginx
Name: ks-ingress-nginx
Namespace: kube-system
Labels: chart_group=ingress-nginx
Annotations: <none>
API Version: helm.toolkit.fluxcd.io/v2beta1
Kind: HelmRelease
Metadata:
  Creation Timestamp: 2023-12-01T20:48:25Z
  Finalizers:
    finalizers.fluxcd.io
  Generation: 1
  Managed Fields:
    API Version: helm.toolkit.fluxcd.io/v2beta1
    Fields Type: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"finalizers.fluxcd.io":
    Manager: helm-controller
    Operation: Update
    Time: 2023-12-01T20:48:25Z
    API Version: helm.toolkit.fluxcd.io/v2beta1
    Fields Type: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:chart_group:
      f:spec:
        .:
        f:chart:
          .:
          f:spec:
            .:
            f:chart:
            f:reconcileStrategy:
            f:sourceRef:
              .:
              f:kind:
              f:name:
            f:version:
        f:install:
          .:
          f:disableHooks:
        f:interval:
        f:releaseName:
        f:test:
          .:
          f:enable:
        f:timeout:
        f:upgrade:
          .:
          f:disableHooks:
        f:valuesFrom:
    Manager: kubectl-client-side-apply
    Operation: Update
    Time: 2023-12-01T20:48:25Z
    API Version: helm.toolkit.fluxcd.io/v2beta1
    Fields Type: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:failures:
        f:helmChart:
        f:lastAppliedRevision:
        f:lastAttemptedRevision:
        f:lastAttemptedValuesChecksum:
        f:lastReleaseRevision:
        f:observedGeneration:
        f:upgradeFailures:
    Manager: helm-controller
    Operation: Update
    Subresource: status
    Time: 2024-01-10T16:43:30Z
  Resource Version: 20773261
  UID: b4cd91a4-c6fa-41df-bf68-8abcaf30f282
Spec:
  Chart:
    Spec:
      Chart: ingress-nginx
      Reconcile Strategy: ChartVersion
      Source Ref:
        Kind: HelmRepository
        Name: stx-platform
      Version: 4.0.15
  Install:
    Disable Hooks: false
  Interval: 1m
  Release Name: ic-nginx-ingress
  Test:
    Enable: false
  Timeout: 30m
  Upgrade:
    Disable Hooks: false
  Values From:
    Kind: Secret
    Name: ingress-nginx-static-overrides
    Values Key: ingress-nginx-static-overrides.yaml
    Kind: Secret
    Name: ingress-nginx-system-overrides
    Values Key: ingress-nginx-system-overrides.yaml
Status:
  Conditions:
    Last Transition Time: 2024-01-10T16:43:30Z
    Message: upgrade retries exhausted
    Reason: UpgradeFailed
    Status: False
    Type: Ready
    Last Transition Time: 2024-01-10T16:43:30Z
    Message: Helm upgrade failed: cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limitLast Helm logs:Patch DaemonSet "ic-nginx-ingress-ingress-nginx-controller" in namespace kube-system
error updating the resource "ic-nginx-ingress-ingress-nginx-controller":
   cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limit
Looks like there are no changes for IngressClass "nginx"
Patch ValidatingWebhookConfiguration "ic-nginx-ingress-ingress-nginx-admission" in namespace
warning: Upgrade "ic-nginx-ingress" failed: cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limit
    Reason: UpgradeFailed
    Status: False
    Type: Released
  Failures: 9
  Helm Chart: kube-system/kube-system-ks-ingress-nginx
  Last Applied Revision: 4.0.15
  Last Attempted Revision: 4.0.15
  Last Attempted Values Checksum: 78fa2175ba00ae13a7b2234b1a1e55f6f235195f
  Last Release Revision: 8
  Observed Generation: 1
  Upgrade Failures: 1
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal info 5m27s (x6 over 35m) helm-controller Helm upgrade succeeded
  Normal info 3m27s (x7 over 35m) helm-controller Helm upgrade has started
  Warning error 3m21s helm-controller Helm upgrade failed: cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limitLast Helm logs:Patch DaemonSet "ic-nginx-ingress-ingress-nginx-controller" in namespace kube-system
error updating the resource "ic-nginx-ingress-ingress-nginx-controller":
   cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limit
Looks like there are no changes for IngressClass "nginx"
Patch ValidatingWebhookConfiguration "ic-nginx-ingress-ingress-nginx-admission" in namespace
warning: Upgrade "ic-nginx-ingress" failed: cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limit
  Warning error 3m21s helm-controller reconciliation failed: Helm upgrade failed: cannot patch "ic-nginx-ingress-ingress-nginx-controller" with kind DaemonSet: DaemonSet.apps "ic-nginx-ingress-ingress-nginx-controller" is invalid: spec.template.spec.containers[0].resources.requests: Invalid value: "255": must be less than or equal to cpu limit
  Warning error 9s (x8 over 3m20s) helm-controller reconciliation failed: upgrade retries exhausted

Test Activity
-------------
Testing

Workaround
----------
n/a

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → David Barbosa Bastos (dbarbosa-wr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/908856
Committed: https://opendev.org/starlingx/config/commit/ce4b7c1eb328c8c6bc443da4fd5b241f5384b207
Submitter: "Zuul (22348)"
Branch: master

commit ce4b7c1eb328c8c6bc443da4fd5b241f5384b207
Author: David Bastos <email address hidden>
Date: Mon Feb 12 15:51:59 2024 -0300

    Fix misleading app status after failed override update

    Application status was misleading after a failed override update with
    illegal values. Application should be in failed (apply-failed) state,
    and alarm should be raised accordingly. Instead, we're led to believe
    that the update was completed successfully.

    The solution consists of adding a default delay to the system of 60
    seconds before changing the helmrelease status. This way we ensure
    that reconciliation has already been called.

    This also ensures that any application can override this default
    value via metadata. Just create a variable with the same name with
    the amount of time that is needed.

    Test Plan:
    PASS: Build-pkgs && build-image
    PASS: Upload, apply, delete and update nginx-ingress-controller
    PASS: Upload, apply, delete and update platform-integ-apps
    PASS: Upload, apply, delete and update metrics-server
    PASS: Update user overrides (system user-override-update) with illegal
          values. When reapplying the app it should fail.
    PASS: Update user overrides (system user-override-update) with correct
          values. When reapplying the app it should complete successfully.
    PASS: If the app has the fluxcd_hr_reconcile_check_delay key in its
          metadata, the system's default delay value must be overwritten.

    Closes-Bug: 2053276

    Change-Id: I5e75745009be235e2646a79764cb4ff619a93d59
    Signed-off-by: David Bastos <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.apps
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.