Backup & Restore: DX system - Ansible Bootstrap failed executing restore command when backup is run from controller-1

Bug #1955162 reported by Mihnea Saracin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------

Restore failed during bootstrap essential services. Backup from taken from controller-1.

Severity
--------

<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------

Install duplex system with stx master
Swact from controller-0 to to controller-1
Run the Backup Ansible playbook
Install a clean image of stx in the system
Run the restore Ansible playbook with the backup file saved above
Expected Behavior

Run the restore Ansible playbook sucessfully.

Actual Behavior
----------------

Fail running restore Ansible playbook sucessfully

Reproducibility
---------------

Reproducible

System Configuration
--------------------

Duplex system

Default configuration.

Branch/Pull Time/Commit
--------------------

SW_VERSION="21.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2021-10-21_00-00-06"

Last Pass
--------------------

This is the first time this TC is run for latest release

 1. Verified in a dc lab using 2021-05-22_23-32-17 build

Backup taken from controller_1, restored in controller-0.

Timestamp/Logs
--------------------

TASK [bootstrap/bringup-essential-services : Get wait tasks results] ***************************************************************************
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'k8s-app=kube-proxy', u'ansible_job_id': u'443674968421.142067', 'failed': False, u'started': 1, 'changed': True, 'item': u'k8s-app=kube-proxy', u'finished': 0, u'results_file': u'/root/.ansible_async/443674968421.142067', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'app=multus', u'ansible_job_id': u'464206363239.142123', 'failed': False, u'started': 1, 'changed': True, 'item': u'app=multus', u'finished': 0, u'results_file': u'/root/.ansible_async/464206363239.142123', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'app=sriov-cni', u'ansible_job_id': u'941260772048.142247', 'failed': False, u'started': 1, 'changed': True, 'item': u'app=sriov-cni', u'finished': 0, u'results_file': u'/root/.ansible_async/941260772048.142247', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-apiserver', u'ansible_job_id': u'887621538031.142316', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-apiserver', u'finished': 0, u'results_file': u'/root/.ansible_async/887621538031.142316', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-controller-manager', u'ansible_job_id': u'27963140421.142391', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-controller-manager', u'finished': 0, u'results_file': u'/root/.ansible_async/27963140421.142391', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label': u'component=kube-scheduler', u'ansible_job_id': u'685428121851.142449', 'failed': False, u'started': 1, 'changed': True, 'item': u'component=kube-scheduler', u'finished': 0, u'results_file': u'/root/.ansible_async/685428121851.142449', '_ansible_ignore_errors': None, '_ansible_no_log': False})
FAILED - RETRYING: Get wait tasks results (40 retries left).
FAILED - RETRYING: Get wait tasks results (39 retries left).
FAILED - RETRYING: Get wait tasks results (38 retries left).
FAILED - RETRYING: Get wait tasks results (37 retries left).
FAILED - RETRYING: Get wait tasks results (36 retries left).
FAILED - RETRYING: Get wait tasks results (35 retries left).
FAILED - RETRYING: Get wait tasks results (34 retries left).
FAILED - RETRYING: Get wait tasks results (33 retries left).
FAILED - RETRYING: Get wait tasks results (32 retries left).
FAILED - RETRYING: Get wait tasks results (31 retries left).
FAILED - RETRYING: Get wait tasks results (30 retries left).
FAILED - RETRYING: Get wait tasks results (29 retries left).
FAILED - RETRYING: Get wait tasks results (28 retries left).
FAILED - RETRYING: Get wait tasks results (27 retries left).
FAILED - RETRYING: Get wait tasks results (26 retries left).
FAILED - RETRYING: Get wait tasks results (25 retries left).
FAILED - RETRYING: Get wait tasks results (24 retries left).
FAILED - RETRYING: Get wait tasks results (23 retries left).
FAILED - RETRYING: Get wait tasks results (22 retries left).
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}
, u'ansible_job_id': u'947219856717.142512', 'failed': False, u'started': 1, 'changed': True, 'item': {u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}, u'finished': 0, u'results_file': u'/root/.ansible_async/947219856717.142512', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'flux-helm', u'deployment': u'helm-controller'}
, u'ansible_job_id': u'124310910849.142684', 'failed': False, u'started': 1, 'changed': True, 'item': {u'namespace': u'flux-helm', u'deployment': u'helm-controller'}, u'finished': 0, u'results_file': u'/root/.ansible_async/124310910849.142684', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'flux-helm', u'deployment': u'source-controller'}
, u'ansible_job_id': u'233493705004.142800', 'failed': False, u'started': 1, 'changed': True, 'item': {u'namespace': u'flux-helm', u'deployment': u'source-controller'}, u'finished': 0, u'results_file': u'/root/.ansible_async/233493705004.142800', '_ansible_ignore_errors': None, '_ansible_no_log': False})
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'armada', u'deployment': u'armada-api'}
, u'ansible_job_id': u'998341509452.142878', 'failed': False, u'started': 1, 'changed': True, 'item': {u'namespace': u'armada', u'deployment': u'armada-api'}, u'finished': 0, u'results_file': u'/root/.ansible_async/998341509452.142878', '_ansible_ignore_errors': None, '_ansible_no_log': False})
FAILED - RETRYING: Get wait tasks results (40 retries left).
changed: [localhost] => (item={'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'kube-system', u'deployment': u'coredns'}
, u'ansible_job_id': u'915942839727.142954', 'failed': False, u'started': 1, 'changed': True, 'item': {u'namespace': u'kube-system', u'deployment': u'coredns'}, u'finished': 0, u'results_file': u'/root/.ansible_async/915942839727.142954', '_ansible_ignore_errors': None, '_ansible_no_log': False})

TASK [bootstrap/bringup-essential-services : Fail if any of the Kubernetes component, Networking or Armada pods are not ready by this time] ****
failed: [localhost] (item={'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for the condition on deployments/calico-kube-controllers'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on deployments/calico-kube-controllers', u'ansible_job_id': u'947219856717.142512', u'stdout': u'', '_ansible_item_result': True, u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': False, u'_raw_params': u'kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=kube-system --for=condition=Available deployment calico-kube-controllers --timeout=120s', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin': None}}, 'attempts': 20, u'delta': u'0:02:00.080321', 'stdout_lines': [], 'failed_when_result': False, '_ansible_no_log': False, u'end': u'2021-10-27 13:54:44.320076', '_ansible_item_label': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}
, u'ansible_job_id': u'947219856717.142512', 'item': {u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/947219856717.142512', '_ansible_ignore_errors': None, '_ansible_no_log': False}, u'start': u'2021-10-27 13:52:44.239755', u'cmd': [u'kubectl', u'--kubeconfig=/etc/kubernetes/admin.conf', u'wait', u'--namespace=kube-system', u'--for=condition=Available', u'deployment', u'calico-kube-controllers', u'--timeout=120s'], u'finished': 1, u'failed': False, 'item': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'947219856717.142512', 'item':

{u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}
, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/947219856717.142512', '_ansible_ignore_errors': None, '_ansible_item_label': {u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'}}, u'rc': 1, u'msg': u'non-zero return code', '_ansible_ignore_errors': None}) => {"changed": false, "item": {"ansible_job_id": "947219856717.142512", "attempts": 20, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Available", "deployment", "calico-kube-controllers", "--timeout=120s"], "delta": "0:02:00.080321", "end": "2021-10-27 13:54:44.320076", "failed": false, "failed_when_result": false, "finished": 1, "invocation": {"module_args": {"_raw_params": "kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=kube-system --for=condition=Available deployment calico-kube-controllers --timeout=120s", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": {"ansible_job_id": "947219856717.142512", "changed": true, "failed": false, "finished": 0, "item":

{"deployment": "calico-kube-controllers", "namespace": "kube-system"}
, "results_file": "/root/.ansible_async/947219856717.142512", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2021-10-27 13:52:44.239755", "stderr": "error: timed out waiting for the condition on deployments/calico-kube-controllers", "stderr_lines": ["error: timed out waiting for the condition on deployments/calico-kube-controllers"], "stdout": "", "stdout_lines": []}, "msg": "Pod {u'namespace': u'kube-system', u'deployment': u'calico-kube-controllers'} is still not ready."}
failed: [localhost] (item={'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for the condition on deployments/armada-api'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on deployments/armada-api', u'ansible_job_id': u'998341509452.142878', u'stdout': u'', '_ansible_item_result': True, u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': False, u'_raw_params': u'kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=armada --for=condition=Available deployment armada-api --timeout=120s', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin': None}}, 'attempts': 1, u'delta': u'0:02:00.072974', 'stdout_lines': [], 'failed_when_result': False, '_ansible_no_log': False, u'end': u'2021-10-27 13:54:47.596154', '_ansible_item_label': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'armada', u'deployment': u'armada-api'}
, u'ansible_job_id': u'998341509452.142878', 'item': {u'namespace': u'armada', u'deployment': u'armada-api'}, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/998341509452.142878', '_ansible_ignore_errors': None, '_ansible_no_log': False}, u'start': u'2021-10-27 13:52:47.523180', u'cmd': [u'kubectl', u'--kubeconfig=/etc/kubernetes/admin.conf', u'wait', u'--namespace=armada', u'--for=condition=Available', u'deployment', u'armada-api', u'--timeout=120s'], u'finished': 1, u'failed': False, 'item': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'998341509452.142878', 'item':

{u'namespace': u'armada', u'deployment': u'armada-api'}
, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/998341509452.142878', '_ansible_ignore_errors': None, '_ansible_item_label': {u'namespace': u'armada', u'deployment': u'armada-api'}}, u'rc': 1, u'msg': u'non-zero return code', '_ansible_ignore_errors': None}) => {"changed": false, "item": {"ansible_job_id": "998341509452.142878", "attempts": 1, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=armada", "--for=condition=Available", "deployment", "armada-api", "--timeout=120s"], "delta": "0:02:00.072974", "end": "2021-10-27 13:54:47.596154", "failed": false, "failed_when_result": false, "finished": 1, "invocation": {"module_args": {"_raw_params": "kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=armada --for=condition=Available deployment armada-api --timeout=120s", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": {"ansible_job_id": "998341509452.142878", "changed": true, "failed": false, "finished": 0, "item":

{"deployment": "armada-api", "namespace": "armada"}
, "results_file": "/root/.ansible_async/998341509452.142878", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2021-10-27 13:52:47.523180", "stderr": "error: timed out waiting for the condition on deployments/armada-api", "stderr_lines": ["error: timed out waiting for the condition on deployments/armada-api"], "stdout": "", "stdout_lines": []}, "msg": "Pod {u'namespace': u'armada', u'deployment': u'armada-api'} is still not ready."}
failed: [localhost] (item={'_ansible_parsed': True, 'stderr_lines': [u'error: timed out waiting for the condition on deployments/coredns'], u'changed': True, u'stderr': u'error: timed out waiting for the condition on deployments/coredns', u'ansible_job_id': u'915942839727.142954', u'stdout': u'', '_ansible_item_result': True, u'invocation': {u'module_args': {u'creates': None, u'executable': None, u'_uses_shell': False, u'_raw_params': u'kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=kube-system --for=condition=Available deployment coredns --timeout=120s', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin': None}}, 'attempts': 2, u'delta': u'0:02:00.069320', 'stdout_lines': [], 'failed_when_result': False, '_ansible_no_log': False, u'end': u'2021-10-27 13:54:48.689142', '_ansible_item_label': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_item_label':

{u'namespace': u'kube-system', u'deployment': u'coredns'}
, u'ansible_job_id': u'915942839727.142954', 'item': {u'namespace': u'kube-system', u'deployment': u'coredns'}, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/915942839727.142954', '_ansible_ignore_errors': None, '_ansible_no_log': False}, u'start': u'2021-10-27 13:52:48.619822', u'cmd': [u'kubectl', u'--kubeconfig=/etc/kubernetes/admin.conf', u'wait', u'--namespace=kube-system', u'--for=condition=Available', u'deployment', u'coredns', u'--timeout=120s'], u'finished': 1, u'failed': False, 'item': {'_ansible_parsed': True, '_ansible_item_result': True, '_ansible_no_log': False, u'ansible_job_id': u'915942839727.142954', 'item':

{u'namespace': u'kube-system', u'deployment': u'coredns'}
, u'started': 1, 'changed': True, 'failed': False, u'finished': 0, u'results_file': u'/root/.ansible_async/915942839727.142954', '_ansible_ignore_errors': None, '_ansible_item_label': {u'namespace': u'kube-system', u'deployment': u'coredns'}}, u'rc': 1, u'msg': u'non-zero return code', '_ansible_ignore_errors': None}) => {"changed": false, "item": {"ansible_job_id": "915942839727.142954", "attempts": 2, "changed": true, "cmd": ["kubectl", "--kubeconfig=/etc/kubernetes/admin.conf", "wait", "--namespace=kube-system", "--for=condition=Available", "deployment", "coredns", "--timeout=120s"], "delta": "0:02:00.069320", "end": "2021-10-27 13:54:48.689142", "failed": false, "failed_when_result": false, "finished": 1, "invocation": {"module_args": {"_raw_params": "kubectl --kubeconfig=/etc/kubernetes/admin.conf wait --namespace=kube-system --for=condition=Available deployment coredns --timeout=120s", "_uses_shell": false, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": {"ansible_job_id": "915942839727.142954", "changed": true, "failed": false, "finished": 0, "item":

{"deployment": "coredns", "namespace": "kube-system"}
, "results_file": "/root/.ansible_async/915942839727.142954", "started": 1}, "msg": "non-zero return code", "rc": 1, "start": "2021-10-27 13:52:48.619822", "stderr": "error: timed out waiting for the condition on deployments/coredns", "stderr_lines": ["error: timed out waiting for the condition on deployments/coredns"], "stdout": "", "stdout_lines": []}, "msg": "Pod {u'namespace': u'kube-system', u'deployment': u'coredns'} is still not ready."}

PLAY RECAP *************************************************************************************************************************************
localhost : ok=410 changed=223 unreachable=0 failed=1

Alarms

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
----------------------------------------------------------------------------++++------------------------------------------

Alarm ID Reason Text Entity ID Severity Time Stamp
----------------------------------------------------------------------------++++------------------------------------------

200.001 controller-0 was administratively locked to take it out-of-service. host=controller-0 warning 2021-10-27T1
        3:49:55.
        015118

----------------------------------------------------------------------------++++------------------------------------------

Test Activity
--------------------

Testing

Workaround
--------------------

no

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / medium - fix in master branch; no requirement to include for stx.6.0 as it's close to being released. Mitigation is to issue B&R w/ controller-0 as active

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.update
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/822130
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/9a790282e588a066f28c6a42f7683146f137b558
Submitter: "Zuul (22348)"
Branch: master

commit 9a790282e588a066f28c6a42f7683146f137b558
Author: Mihnea Saracin <email address hidden>
Date: Fri Dec 17 17:36:01 2021 +0200

    Fix Backup&Restore when backup is taken on controller-1

    There are 2 main problems when restoring a backup from controller-1:

    - The certificates that are generated by k8s can only be used on
      controller-1. The fix for this is to let k8s regenerate those
      when restoring a backup taken from controller-1.

      In kube-controller-manager and kube-scheduler I've seen logs like:
      error retrieving resource lock kube-system/kube-controller-manager:
      Get
      "https://192.168.205.2:6443/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s":
      x509: certificate is valid for 10.96.0.1, 192.168.205.3, 192.168.205.1,
      127.0.0.1, 128.224.49.105, 128.224.48.105, 128.224.48.106, not
      192.168.205.2

      Where the 192.168.205.2 ip is the controller-0-cluster-host.

    - The ceph.conf from controller-1 can no longer
      be used on controller-0 when restoring.(Due to recent ceph changes).
      To fix this, when we take backup on controller-1
      we also backup ceph.conf from controller-0 and use it at restore.

    Test Plan:

     PASS: AIO-SX bootstrap
     PASS: AIO-DX bootstrap
     PASS: STANDARD bootstrap
     PASS: B&R on AIO-SX
     PASS: B&R on AIO-DX with backup taken from both controllers
     PASS: B&R on STANDARD with backup taken from both controllers

    Closes-Bug: 1955162
    Change-Id: I2e9c7d81113d04782d91efaaa568d9b2bdd20672
    Signed-off-by: Mihnea Saracin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.