Backup & Restore: During AIO-DX restore, ingress validating webhook pod does not terminate

Bug #1988056 reported by Joshua Kraitberg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Joshua Kraitberg

Bug Description

Brief Description
-----------------
During one of the steps in restore, progress will get stuck and block forever.

During restore, etcd will be in a confused state. If multiple nodes were configured during backup, pods assigned to those other nodes, eg. controller-1, will only be removable with '--force' flag.

To preview this, run 'kubectl get pods --all-namespaces'. This will show that pods are running on other nodes, despite the nodes not being installed yet.

```
sysadmin@controller-0:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controller-0 Ready control-plane,master 45h v1.23.1
controller-1 NotReady control-plane,master 44h v1.23.1
```

```
sysadmin@controller-0:~$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
ic-nginx-ingress-ingress-nginx-controller-sndpp 1/1 Running 0 44h 192.168.204.3 controller-1 <none> <none>
...
```

Severity
--------
Minor

Steps to Reproduce
------------------
Run a restore using a backup from AIO-DX.

Expected Behavior
------------------
Restore works.

Actual Behavior
----------------
Restore get stuck forever because pod cannot be killed.

Reproducibility
---------------
100%.

System Configuration
--------------------
Multi-node system.

Branch/Pull Time/Commit
-----------------------
N/A.

Last Pass
---------
N/A.

Timestamp/Logs
--------------
2022-08-24 01:59:18,869 p=1986 u=sysadmin n=ansible | TASK [common/armada-helm : If on system restore mode, kill ingress validating webhook pod so it can be recreated]

Test Activity
-------------
Developer Testing

Workaround
----------
Add '--force' flag with deleting pods.

Changed in starlingx:
assignee: nobody → Joshua Kraitberg (jkraitbe-wr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855037
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/3a5a56cf82e0251e6c34461de4d92a9568552bc4
Submitter: "Zuul (22348)"
Branch: master

commit 3a5a56cf82e0251e6c34461de4d92a9568552bc4
Author: Joshua Kraitberg <email address hidden>
Date: Mon Aug 29 10:00:09 2022 -0400

    Force delete ic-nginx-ingress-controller during restore

    During restore, etcd will be in a confused state. If multiple nodes
    were configured during backup, pods assigned to those other nodes,
    eg. controller-1, will only be removable with '--force' flag.

    To preview this, run 'kubectl get pods --all-namespaces'.
    This will show that pods are running on other nodes,
    despite the nodes not being installed yet.

    TEST PLAN
    Backup & Restore on virtual AIO-DX (Debian)

    Closes-Bug: 1988056
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: Ibeb1592d83d4612c52ad69cda77d326fa19b7d12

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Joshua Kraitberg (jkraitbe-wr) wrote :

The previous fix introduced a race condition in a subsequent step. Depending on how the pods are restarted after being "--force" deleted the system state will not be registered as ready.

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by "Joshua Kraitberg <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855392

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855274
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/ac6a57bd584cf03a47b6cd6a5211d5252c050988
Submitter: "Zuul (22348)"
Branch: master

commit ac6a57bd584cf03a47b6cd6a5211d5252c050988
Author: Joshua Kraitberg <email address hidden>
Date: Tue Aug 30 17:47:24 2022 +0000

    Wait for ic-nginx-ingress-controller only on controller-0

    There is an intermittent failure when waiting for these pods to be
    restarted during a restore. Depending on the how the pods are
    restarted by kubernetes, the wait command may or may not return a failure.

    TEST PLAN
    Deploy virtual AIO-DX (Debian)
    Backup virtual AIO-DX, then
    Repeat several times with clean VMs: Restore virtual AIO-DX (Debian)

    Closes-Bug: 1988056
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: Ibb98b6ab0d0de58b29974c9238abf3dcd89d4811

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.8.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.