Vault agent injector on AIO-SX should not have anti-affinity rule

Bug #2030901 reported by Michel Thebeau [WIND]
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Tae Park

Bug Description

Brief Description
-----------------

On AIO-SX, when applying an application update on vault which includes a change to vault injector Deployment resource: the new injector pod cannot schedule because a pod is already running and there is only one node.

Severity
--------

Minor: workaround exists

Steps to Reproduce
------------------

# Where the current vault inject is using latest image tag 1.2.1,
# tell the application to use 1.2.0 in order to prompt the injector
# pod to 'update'.

# This workflow is loosely based on the following commit, which
# is identified as an example change that causes the condition:
#   commit 198f4e51 "set images to pull from configured registries"

# This sample yaml to cause vault injector to be updated
$ cat <<EOF > vault-injector.yaml
injector:
  image:
    tag: 1.2.0
EOF

# show and update helm overrides
$ system helm-override-list vault
$ system helm-override-show vault vault vault
$ system helm-override-update vault vault vault \
  --values=vault-injector.yaml
$ system helm-override-show vault vault vault

# apply the new helm overrides
$ system application-apply vault
$ system application-list

# observe pod that is not being scheduled:
$ kubectl get pods -n vault

# examine the pod events to see the bug's symptom
$ unschedulablePod=sva-vault-agent-injector-57854f4589-zqb5r
$ kubectl describe pods -n vault $unschedulablePod \
  | grep anti-affinity

  Warning  FailedScheduling  4m28s  default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod anti-affinity rules. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Expected Behavior
------------------

When the vault application is updated on AIO-SX, all of the pods for which resources are updated can restart.

Actual Behavior
----------------

Vault injector agent cannot schedule.

Reproducibility
---------------

100% Reproducible if an app update contains a change for vault injector Deployment resource.

System Configuration
--------------------

AIO-SX only

Branch/Pull Time/Commit
-----------------------

Starlingx master, 20230508T060000Z

Last Pass
---------

N/A, but never. the defect is day one. (I have not looked at older vault versions).

Timestamp/Logs
--------------

N/A, per steps to reproduce

Test Activity
-------------

Feature test of vault for another defect, including test of auto-update functionality.

Workaround
----------

1. Before performing application-update:

###
# First, update the current deployment to disable anti-affinity
# and permit the number of running pods to be zero during update

# maxUnavailable is required to work-around current replicaset's
# configuration

cat <<EOF >injector_override.yaml
injector:
  strategy:
    rollingUpdate:
      maxUnavailable: 100%
  affinity: {}
EOF

system helm-override-update vault vault vault --values=injector_override.yaml
system helm-override-show vault vault vault
system application-apply vault

# wait for replicaset and pod to restart

###
# Second, use the helm overrides we actually want to use by default for AIO-SX

cat <<EOF >injector_override.yaml
injector:
  affinity: {}
EOF

system helm-override-update vault vault vault --values=injector_override.yaml
system helm-override-show vault vault vault
system application-apply vault

# wait for replicaset and pod to restart

###
# Finally, perform the application update that was intended

# etc.
# this assumes the new application also omits anti-affinity

2. If application was run and the bug's symptom is observed

# Within 30 minutes for application-update beginning
Delete the replicaset of the running pod so that the new pod can run.

Tae Park (tparkwr)
Changed in starlingx:
assignee: nobody → Tae Park (tparkwr)
Ghada Khalil (gkhalil)
tags: added: stx.9.0 stx.apps stx.security
Tae Park (tparkwr)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to vault-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/vault-armada-app/+/891091
Committed: https://opendev.org/starlingx/vault-armada-app/commit/f7a37e6ad91b7a0efa79c9cb9783af343344ad33
Submitter: "Zuul (22348)"
Branch: master

commit f7a37e6ad91b7a0efa79c9cb9783af343344ad33
Author: Tae Park <email address hidden>
Date: Thu Aug 10 14:16:40 2023 -0400

    Removing default injector anti-affinity rules

    Adding a null override over default anti-affinity rules for vault injectors. The default rule only allow one vault injector pod at a time. This is a problem because helm-override and application apply will try to schedule a new pod first before completely removing the old pod.
    This change lets a new vault agent injector pod to be scheduled without issue.

    TEST PLAN:
     - Test for AIO-SX
     - Update helm-override so that vault-injector has a different image tag than default
     - apply the new helm-override
     - There should be no FailedScheduling error in the vault pods
     - Sanity test for both AIO-SX and AIO-DX + 1 worker

    Closes-bug: 2030901

    Change-Id: I9814f502558ab1cbecad48cf37341639c964258f
    Signed-off-by: Tae Park <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.