vault-manager remains inactive when the cluster host it runs on is locked

Bug #2029375 reported by Michel Thebeau [WIND]
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Tae Park

Bug Description

Brief Description
-----------------
When vault is in HA configuration (3 vault servers), with only three available cluster nodes: if the cluster node upon which vault-manager is running is locked then vault-manager will reschedule but not run because it is waiting for all three vault server pods to run.

Severity
--------
Minor - not impact of vault function unless other conditions occur

Steps to Reproduce
------------------
Configure AIO-DX plus one worker, or standard controller with one worker. Apply and configure vault application per Starlingx documentation.

Starlingx Vault Reference: https://docs.starlingx.io/security/kubernetes/security-vault-overview.html

Confirm that the vault-manager and 3 vault server pods are running:
  kubectl get pods -n vault

Identify the cluster node upon which vault manager is running:
  kubectl get pods -n vault sva-vault-manager-0 -o jsonpath="{.spec.nodeName}{'\n'}"

Use 'system host-lock' command to lock the cluster node where vault-manager is running.

Wait for vault-manager pod to be rescheduled. Watch the vault-manager pod log to see that it remains in initializing state with log "Waiting for sva-vault statefulset running pods":
  kubectl logs -f -n vault sva-vault-manager-0

Expected Behavior
------------------
Since the vault cluster is initialized already, vault manager does not need to wait for the number of pods in statefulset to equal the configured replica count.

Actual Behavior
----------------
Vault-manager remains in initializing state until three of the platform nodes are unlocked

Reproducibility
---------------
100% With 3 cluster nodes (two controllers and one worker)

System Configuration
--------------------
AIO-DX plux one worker
Standard configuration with one worker (2+1)

Branch/Pull Time/Commit
-----------------------
starlingx master

Last Pass
---------
N/A, probably day 1 bug

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer testing

Workaround
----------
- Evacuate vault-manager before locking the cluster node, or
- unlock all platform hosts, or
- use AIO-DX plus 2 workers,
- use or Standard controller plus 2 workers.

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tae Park (tparkwr)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to vault-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/vault-armada-app/+/890256
Committed: https://opendev.org/starlingx/vault-armada-app/commit/896008fb732dc1d1541564da109275a753b6e65c
Submitter: "Zuul (22348)"
Branch: master

commit 896008fb732dc1d1541564da109275a753b6e65c
Author: Tae Park <email address hidden>
Date: Tue Aug 1 17:29:01 2023 -0400

    vault-manager wait for one server only when initialized

    Modifying the vault-manager initialization logic so that it only waits
    for pod number equal to the replica value to be active
    if the raft is not yet initialized.

    TEST PLAN:
     - In a 2 controller, 1 worker setup,
     - Upload and apply vault
     - Lock the host that vault-manager is running on
     - Vault manager should restart
     - Within the logs, there should not be a repetition of " Waiting for sva-vault statefulset running pods..."
     - Vault Sanity test in AIO-SX
     - Bashate of rendered init.sh

    Closes-bug: 2029375

    Signed-off-by: Tae Park <email address hidden>
    Change-Id: I41990b87395a5d5364ef91c048f740d0f0675d6b

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.9.0 stx.apps
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.