StarlingX

Backup & Restore: During AIO-DX restore, ingress validating webhook pod does not terminate

Bug #1988056 reported by Joshua Kraitberg on 2022-08-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Joshua Kraitberg

Bug Description

Brief Description
-----------------
During one of the steps in restore, progress will get stuck and block forever.

During restore, etcd will be in a confused state. If multiple nodes were configured during backup, pods assigned to those other nodes, eg. controller-1, will only be removable with '--force' flag.

To preview this, run 'kubectl get pods --all-namespaces'. This will show that pods are running on other nodes, despite the nodes not being installed yet.

```
sysadmin@controller-0:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
controller-0 Ready control-plane,master 45h v1.23.1
controller-1 NotReady control-plane,master 44h v1.23.1
```

```
sysadmin@controller-0:~$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
...
ic-nginx-ingress-ingress-nginx-controller-sndpp 1/1 Running 0 44h 192.168.204.3 controller-1 <none> <none>
...
```

Severity
--------
Minor

Steps to Reproduce
------------------
Run a restore using a backup from AIO-DX.

Expected Behavior
------------------
Restore works.

Actual Behavior
----------------
Restore get stuck forever because pod cannot be killed.

Reproducibility
---------------
100%.

System Configuration
--------------------
Multi-node system.

Branch/Pull Time/Commit
-----------------------
N/A.

Last Pass
---------
N/A.

Timestamp/Logs
--------------
2022-08-24 01:59:18,869 p=1986 u=sysadmin n=ansible | TASK [common/armada-helm : If on system restore mode, kill ingress validating webhook pod so it can be recreated]

Test Activity
-------------
Developer Testing

Workaround
----------
Add '--force' flag with deleting pods.

Tags:

Joshua Kraitberg (jkraitbe-wr) on 2022-08-29

Changed in starlingx:
assignee:	nobody → Joshua Kraitberg (jkraitbe-wr)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-29: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855037

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-30: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855037
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/3a5a56cf82e0251e6c34461de4d92a9568552bc4
Submitter: "Zuul (22348)"
Branch: master

commit 3a5a56cf82e0251e6c34461de4d92a9568552bc4
Author: Joshua Kraitberg <email address hidden>
Date: Mon Aug 29 10:00:09 2022 -0400

Force delete ic-nginx-ingress-controller during restore

    During restore, etcd will be in a confused state. If multiple nodes
    were configured during backup, pods assigned to those other nodes,
    eg. controller-1, will only be removable with '--force' flag.

    To preview this, run 'kubectl get pods --all-namespaces'.
    This will show that pods are running on other nodes,
    despite the nodes not being installed yet.

TEST PLAN
Backup & Restore on virtual AIO-DX (Debian)

    Closes-Bug: 1988056
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: Ibeb1592d83d4612c52ad69cda77d326fa19b7d12

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Joshua Kraitberg (jkraitbe-wr) wrote on 2022-08-30:

The previous fix introduced a race condition in a subsequent step. Depending on how the pods are restarted after being "--force" deleted the system state will not be registered as ready.

Changed in starlingx:
status:	Fix Released → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-30: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855274

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-31:

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855392

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-31: Change abandoned on ansible-playbooks (master)

Change abandoned by "Joshua Kraitberg <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855392

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-01: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/855274
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/ac6a57bd584cf03a47b6cd6a5211d5252c050988
Submitter: "Zuul (22348)"
Branch: master

commit ac6a57bd584cf03a47b6cd6a5211d5252c050988
Author: Joshua Kraitberg <email address hidden>
Date: Tue Aug 30 17:47:24 2022 +0000

Wait for ic-nginx-ingress-controller only on controller-0

    There is an intermittent failure when waiting for these pods to be
    restarted during a restore. Depending on the how the pods are
    restarted by kubernetes, the wait command may or may not return a failure.

    TEST PLAN
    Deploy virtual AIO-DX (Debian)
    Backup virtual AIO-DX, then
    Repeat several times with clean VMs: Restore virtual AIO-DX (Debian)

    Closes-Bug: 1988056
    Signed-off-by: Joshua Kraitberg <email address hidden>
    Change-Id: Ibb98b6ab0d0de58b29974c9238abf3dcd89d4811

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-09-10

Changed in starlingx:
importance:	Undecided → Low
tags:	added: stx.8.0 stx.update

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.