StarlingX

Backup & Restore: Restore fails with error launching Armada with Helm v3

Bug #1978899 reported by João Pedro Alexandroni Cordova de Sousa on 2022-06-16

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	João Pedro Alexandroni Cordova de Sousa

Bug Description

Brief Description
-----------------
Restore fails executing the Ansible restore playbook on Storage system with IPv4.

Severity
--------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
- Install duplex system with Starlingx on IPv4 system
- Run the Backup Ansible playbook from controller-0
- Install a clean image of Starlingx in the system and wipedisk=false
- Run the restore Ansible playbook with the backup file saved above

Expected Behavior
------------------
Run the Ansible restore playbook and unlock controller-0 and controller-1 sucessfully

Actual Behavior
----------------
Ansible restore playbook fails.

Reproducibility
---------------
Reproducible 2/2

System Configuration
--------------------
AIO-SX
AIO-DX
Storage

Last Pass
---------
This test was not run for a long time on storage.

On duplex systems:
SW_VERSION="22.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-04-05_20-00-06"
SRC_BUILD_ID="1227"

On AIO-SX:
SW_VERSION="22.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-04-03_20-00-06"
SRC_BUILD_ID="1225"

Timestamp/Logs
--------------
E TASK [common/armada-helm : Launch Armada with Helm v3] ***************************************************************************************************************************************************************************************************
E Wednesday 04 May 2022 11:52:01 +0000 (0:00:01.308) 0:19:47.823 *********
E [0;31mfatal: [localhost]: FAILED! => changed=true [0m
E [0;31m cmd:[0m
E [0;31m - /sbin/helm[0m
E [0;31m - upgrade[0m
E [0;31m - --install[0m
E [0;31m - armada[0m
E [0;31m - stx-platform/armada[0m
E [0;31m - --namespace[0m
E [0;31m - armada[0m
E [0;31m - --values[0m
E [0;31m - /tmp/armada-overrides.yaml[0m
E [0;31m - --debug[0m
E [0;31m delta: '0:00:00.480501'[0m
E [0;31m end: '2022-05-04 11:52:02.174253'[0m
E [0;31m msg: non-zero return code[0m
E [0;31m rc: 1[0m
E [0;31m start: '2022-05-04 11:52:01.693752'[0m
E [0;31m stderr: |-[0m
E [0;31m history.go:52: [debug] getting history for release armada[0m
E [0;31m install.go:159: [debug] Original chart version: ""[0m
E [0;31m install.go:176: [debug] CHART PATH: /home/sysadmin/.cache/helm/repository/armada-0.1.0.tgz[0m
E [0;31m [0m
E [0;31m client.go:108: [debug] creating 10 resource(s)[0m
E [0;31m Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused[0m
E [0;31m helm.go:84: [debug] Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused[0m
E [0;31m stderr_lines:[0m
E [0;31m - 'history.go:52: [debug] getting history for release armada'[0m
E [0;31m - 'install.go:159: [debug] Original chart version: ""'[0m
E [0;31m - 'install.go:176: [debug] CHART PATH: /home/sysadmin/.cache/helm/repository/armada-0.1.0.tgz'[0m
E [0;31m - ''[0m
E [0;31m - 'client.go:108: [debug] creating 10 resource(s)'[0m
E [0;31m - 'Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused'[0m
E [0;31m - 'helm.go:84: [debug] Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused'[0m
E [0;31m stdout: Release "armada" does not exist. Installing it now.[0m
E [0;31m stdout_lines: <omitted>[0m
E
E PLAY RECAP

Test Activity
-------------
Regression Testing

Workaround
----------
for vms: exit and reconnect to the vm. The kubernetes becomes accessible againand it can finish the restore after deleting /etc/platform.restore*

See original description

Tags:

João Pedro Alexandroni Cordova de Sousa (alexandroni) on 2022-06-16

Changed in starlingx:
assignee:	nobody → João Pedro Alexandroni Cordova de Sousa (alexandroni)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-16: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846080

Changed in starlingx:
status:	New → In Progress

João Pedro Alexandroni Cordova de Sousa (alexandroni) on 2022-06-16

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-18:

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846471

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-19: Change abandoned on ansible-playbooks (master)

Change abandoned by "João Pedro Alexandroni Cordova de Sousa <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846080
Reason: the new solution is on the review 846471

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-20: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846471
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/77d3f04ee4ab8236313b08aa6f0cca8864629382
Submitter: "Zuul (22348)"
Branch: master

commit 77d3f04ee4ab8236313b08aa6f0cca8864629382
Author: Heitor Matsui <email address hidden>
Date: Sat Jun 18 14:43:04 2022 -0300

Wait for nginx pod and service on restore

    While trying to install armada on a restore scenario, helm
    may return a connection refused error because the ingress
    resource creation triggers a webhook that may not be ready at
    that moment. The webhook is related to nginx pod and services.

This commit adds a step to check if nginx pod and service are
ready and, if not, wait up to 60s for them to be up and ready.

    Test Plan
    PASS: run AIO-SX backup and restore successfully
    PASS: run AIO-SX fresh install, bootstrap and unlock successfully
    PASS: run AIO-SX upgrade successfully

    Closes-bug: 1978899
    Change-Id: Iffbfd1b06b61b057fd5b6b9cfcebc1f4fab6cb36
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Heitor Matsui (heitormatsui) wrote on 2022-06-21:

Moving manually to Fix Released because https://review.opendev.org/c/starlingx/ansible-playbooks/+/846471 merged on 21/jun/2022

Ghada Khalil (gkhalil) on 2022-06-21

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.update

Revision history for this message

Thiago Paiva Brito (outbrito) wrote on 2022-08-09 (last edit on 2022-08-09):

Reopening this bug since I'm seeing this problem repeatedly on a VBox setup. From my investigation, I saw that once k8s comes up after the etcd restore, there is a span of time (around 20s) that it takes the old configuration from the backup as the true and returns that the ic-nginx-ingress-ingress-nginx-controller-XXXX pod is "Ready", but it is not... in several instances during my tests, the pod was restarted 3-10 seconds after the task failed (checked the timestamp of the failure against the first log on the new pod's container).

Happened 7/10 times on VBox AIO-SX.

tags:

added: stx.8

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-09: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/852677

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-08-11: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/852677
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/c2e5db4305bca4f39a3391afd136b46216cb7d3f
Submitter: "Zuul (22348)"
Branch: master

commit c2e5db4305bca4f39a3391afd136b46216cb7d3f
Author: Thiago Brito <email address hidden>
Date: Tue Aug 9 18:34:43 2022 -0300

Deleting ic-nginx-ingress-controller at restore

    Once k8s comes up after the etcd restore, there is a span of time
    (around 20s) that the pod states have not been updated and are reported
    as they were at the point in time where the backup was taken. This
    returns that the ic-nginx-ingress-ingress-nginx-controller-XXX pod is
    "Ready", but it is not... in several instances during my tests, the pod
    was restarted 3-10 seconds after the task "Launch Armada with Helm v3"
    failed due to not being able to call the webhook. The proposed solution
    is to delete the pod preemptively and wait for it to be recreated and
    "Ready".

TEST PLAN
PASS restore on virtual AIO-SX (CentOS)

    Closes-Bug: #1978899
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I20bec1fbbf809bfcf5d515ef55c6d47ab968dbf3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.