Backup & Restore: Restore fails with error launching Armada with Helm v3

Bug #1978899 reported by João Pedro Alexandroni Cordova de Sousa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
João Pedro Alexandroni Cordova de Sousa

Bug Description

Brief Description
-----------------
Restore fails executing the Ansible restore playbook on Storage system with IPv4.

Severity
--------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
------------------
- Install duplex system with Starlingx on IPv4 system
- Run the Backup Ansible playbook from controller-0
- Install a clean image of Starlingx in the system and wipedisk=false
- Run the restore Ansible playbook with the backup file saved above

Expected Behavior
------------------
Run the Ansible restore playbook and unlock controller-0 and controller-1 sucessfully

Actual Behavior
----------------
Ansible restore playbook fails.

Reproducibility
---------------
Reproducible 2/2

System Configuration
--------------------
AIO-SX
AIO-DX
Storage

Last Pass
---------
This test was not run for a long time on storage.

On duplex systems:
SW_VERSION="22.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-04-05_20-00-06"
SRC_BUILD_ID="1227"

On AIO-SX:
SW_VERSION="22.02"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-04-03_20-00-06"
SRC_BUILD_ID="1225"

Timestamp/Logs
--------------
E TASK [common/armada-helm : Launch Armada with Helm v3] ***************************************************************************************************************************************************************************************************
E Wednesday 04 May 2022 11:52:01 +0000 (0:00:01.308) 0:19:47.823 *********
E fatal: [localhost]: FAILED! => changed=true 
E  cmd:
E  - /sbin/helm
E  - upgrade
E  - --install
E  - armada
E  - stx-platform/armada
E  - --namespace
E  - armada
E  - --values
E  - /tmp/armada-overrides.yaml
E  - --debug
E  delta: '0:00:00.480501'
E  end: '2022-05-04 11:52:02.174253'
E  msg: non-zero return code
E  rc: 1
E  start: '2022-05-04 11:52:01.693752'
E  stderr: |-
E  history.go:52: [debug] getting history for release armada
E  install.go:159: [debug] Original chart version: ""
E  install.go:176: [debug] CHART PATH: /home/sysadmin/.cache/helm/repository/armada-0.1.0.tgz
E  
E  client.go:108: [debug] creating 10 resource(s)
E  Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused
E  helm.go:84: [debug] Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused
E  stderr_lines:
E  - 'history.go:52: [debug] getting history for release armada'
E  - 'install.go:159: [debug] Original chart version: ""'
E  - 'install.go:176: [debug] CHART PATH: /home/sysadmin/.cache/helm/repository/armada-0.1.0.tgz'
E  - ''
E  - 'client.go:108: [debug] creating 10 resource(s)'
E  - 'Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused'
E  - 'helm.go:84: [debug] Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post "https://ic-nginx-ingress-ingress-nginx-controller-admission.kube-system.svc:443/networking/v1beta1/ingresses?timeout=10s": dial tcp 10.102.163.201:443: connect: connection refused'
E  stdout: Release "armada" does not exist. Installing it now.
E  stdout_lines: <omitted>
E
E PLAY RECAP

Test Activity
-------------
Regression Testing

Workaround
----------
for vms: exit and reconnect to the vm. The kubernetes becomes accessible againand it can finish the restore after deleting /etc/platform.restore*

Changed in starlingx:
assignee: nobody → João Pedro Alexandroni Cordova de Sousa (alexandroni)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by "João Pedro Alexandroni Cordova de Sousa <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846080
Reason: the new solution is on the review 846471

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846471
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/77d3f04ee4ab8236313b08aa6f0cca8864629382
Submitter: "Zuul (22348)"
Branch: master

commit 77d3f04ee4ab8236313b08aa6f0cca8864629382
Author: Heitor Matsui <email address hidden>
Date: Sat Jun 18 14:43:04 2022 -0300

    Wait for nginx pod and service on restore

    While trying to install armada on a restore scenario, helm
    may return a connection refused error because the ingress
    resource creation triggers a webhook that may not be ready at
    that moment. The webhook is related to nginx pod and services.

    This commit adds a step to check if nginx pod and service are
    ready and, if not, wait up to 60s for them to be up and ready.

    Test Plan
    PASS: run AIO-SX backup and restore successfully
    PASS: run AIO-SX fresh install, bootstrap and unlock successfully
    PASS: run AIO-SX upgrade successfully

    Closes-bug: 1978899
    Change-Id: Iffbfd1b06b61b057fd5b6b9cfcebc1f4fab6cb36
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

Moving manually to Fix Released because https://review.opendev.org/c/starlingx/ansible-playbooks/+/846471 merged on 21/jun/2022

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.update
Revision history for this message
Thiago Paiva Brito (outbrito) wrote (last edit ):

Reopening this bug since I'm seeing this problem repeatedly on a VBox setup. From my investigation, I saw that once k8s comes up after the etcd restore, there is a span of time (around 20s) that it takes the old configuration from the backup as the true and returns that the ic-nginx-ingress-ingress-nginx-controller-XXXX pod is "Ready", but it is not... in several instances during my tests, the pod was restarted 3-10 seconds after the task failed (checked the timestamp of the failure against the first log on the new pod's container).

Happened 7/10 times on VBox AIO-SX.

tags: added: stx.8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/852677
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/c2e5db4305bca4f39a3391afd136b46216cb7d3f
Submitter: "Zuul (22348)"
Branch: master

commit c2e5db4305bca4f39a3391afd136b46216cb7d3f
Author: Thiago Brito <email address hidden>
Date: Tue Aug 9 18:34:43 2022 -0300

    Deleting ic-nginx-ingress-controller at restore

    Once k8s comes up after the etcd restore, there is a span of time
    (around 20s) that the pod states have not been updated and are reported
    as they were at the point in time where the backup was taken. This
    returns that the ic-nginx-ingress-ingress-nginx-controller-XXX pod is
    "Ready", but it is not... in several instances during my tests, the pod
    was restarted 3-10 seconds after the task "Launch Armada with Helm v3"
    failed due to not being able to call the webhook. The proposed solution
    is to delete the pod preemptively and wait for it to be recreated and
    "Ready".

    TEST PLAN
    PASS restore on virtual AIO-SX (CentOS)

    Closes-Bug: #1978899
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I20bec1fbbf809bfcf5d515ef55c6d47ab968dbf3

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.