STX-Openstack: Pods locked in Init state after node reboot

Bug #2076118 reported by Thales Elero Cervi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Daniel Marques Caires

Bug Description

Brief Description
-----------------

After a node reboot, several Pods were locked on Init state with its Init containers waiting for Jobs that were already cleaned after completion by TTL configuration.

Cause: airship/kubernetes-entrypoint DEPENDENCY_JOBS definition conflicting with Jobs TTL

In order to be able to add the "app.starlingx.io/component" spec label to stx-openstack related Jobs [1], it was recently added TTL [2] to helm-toolkit Jobs specs. This goes against the openstack-helm desigin decision of using airship/kubernetes-entrypoint [3] for several Init containers and defining DEPENDENCY_JOBS for it.
Therefore, at least for now, Jobs should not have the TTL configuration and, consequentially, should not receive the spec update adding the "app.starlingx.io/component" label (unless different mechanism is found to overcome the template section in the job is immutable or not updatable [4]

[1] https://storyboard.openstack.org/#!/story/2010612
[2] https://kubernetes.io/docs/concepts/workloads/controllers/job/#clean-up-finished-jobs-automatically
[3] https://opendev.org/airship/kubernetes-entrypoint/src/branch/master/README.md
[4] https://stackoverflow.com/questions/61654433/helm-upgrade-failed-cannot-patch-with-kind-job-by-update-field-image

Severity
--------
Major: System endurance was jeopardized

Steps to Reproduce
------------------
- Apply stx-openstack
- Reboot a node

Expected Behavior
-----------------
All pods should be Running

Actual Behavior
---------------
Some pods are locked in the Init phase

Reproducibility
------
Reproducible

System Configuration
--------------------
Found in AIO-SX (virtual) and AIO-DX (physical) deployments.

Timestamp/Logs
--------------
$ kubectl -n openstack logs -f pod/cinder-backup-6c7b65b98b-bsk52 -c init

Entrypoint WARNING: 2024/07/30 13:09:43 entrypoint.go:72: Resolving dependency Job cinder-db-sync in namespace openstack failed: jobs.batch "cinder-db-sync" not found .

$ kubectl -n openstack logs -f pod/cinder-volume-usage-audit-28704605-j5s9w -c init
Entrypoint WARNING: 2024/07/30 13:10:17 entrypoint.go:72: Resolving dependency Job cinder-db-sync in namespace openstack failed: jobs.batch "cinder-db-sync" not found .Entrypoint WARNING: 2024/07/30 13:10:17 entrypoint.go:72: Resolving dependency Job cinder-ks-user in namespace openstack failed: jobs.batch "cinder-ks-user" not found Entrypoint WARNING: 2024/07/30 13:10:17 entrypoint.go:72: Resolving dependency Job cinder-ks-endpoints in namespace openstack failed: jobs.batch "cinder-ks-endpoints" not found .

$ kubectl -n openstack logs -f pod/heat-engine-cleaner-28704595-vth85 -c initEntrypoint WARNING: 2024/07/30 13:11:04 entrypoint.go:72: Resolving dependency Job heat-ks-user in namespace openstack failed: jobs.batch "heat-ks-user" not found .Entrypoint WARNING: 2024/07/30 13:11:04 entrypoint.go:72: Resolving dependency Job heat-db-sync in namespace openstack failed: jobs.batch "heat-db-sync" not found .Entrypoint WARNING: 2024/07/30 13:11:04 entrypoint.go:72: Resolving dependency Job heat-ks-endpoints in namespace openstack failed: jobs.batch "heat-ks-endpoints" not found .

$ kubectl -n openstack logs -f pod/nova-service-cleaner-28704600-kwvr4 -c init
Entrypoint WARNING: 2024/07/29 23:41:11 entrypoint.go:72: Resolving dependency Job nova-db-sync in namespace openstack failed: jobs.batch "nova-db-sync" not found .

Alarms
------
None

Test Activity
-------------
Developer Testing

Workaround
------------
Remove and apply the application.

Changed in starlingx:
assignee: nobody → Daniel Marques Caires (daniel-caires)
importance: Undecided → High
tags: added: stx.distro.openstack
Changed in starlingx:
status: New → Confirmed
Changed in starlingx:
assignee: Daniel Marques Caires (daniel-caires) → nobody
assignee: nobody → Daniel Marques Caires (dcaires)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/925972
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/a613275054b3f53ca23bbdaed9d27f03bc27e9ff
Submitter: "Zuul (22348)"
Branch: master

commit a613275054b3f53ca23bbdaed9d27f03bc27e9ff
Author: Daniel Caires <email address hidden>
Date: Thu Aug 8 08:25:39 2024 -0300

    Revert ttl addition to STX-Openstack jobs

    The ttl label was added as a way to handle the reapply of
    the application after a user override [1][2].

    The deletion of the jobs using the ttlSecondsAfterFinished
    was causing some pods of the application to be stuck in
    init state after a host reboot.

    Some pods in the STX-Openstack have job dependencies, meaning that
    they will only start if this jobs exist and are completed.

    This review removes the ttlSecondsAfterFinished from the jobs in
    the application.

    [1]: https://review.opendev.org/c/starlingx/openstack-armada-app/+/924351
    [2]: https://review.opendev.org/c/starlingx/openstack-armada-app/+/925481

    Test Plan:
    PASS: STX-Openstack is built
    PASS: STX-Openstack upload and apply
    PASS: All pods come back online after host reboot
    PASS: Remove and Delete STX-Openstack

    Partial-Bug: 2076118

    Change-Id: I0a4392701b255ea2aeb6bc942d085dc3588ca641
    Signed-off-by: Daniel Caires <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.10.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.