Backup & Restore: All apps stucked in restore-requested - fluxcd folder were not restored.

Bug #1979124 reported by Thiago Paiva Brito
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Thiago Paiva Brito

Bug Description

Brief Description
-----------------
After Backup and restore and unlock the controller-0 all the apps stay in "failed to download one or more image(s)" state, and  "restore-requested" status. According to sysinv.log this is because the yaml files from flucd folder are missed.

Severity
--------
<Major: System/Feature is usable but degraded>

Steps to Reproduce
------------------
 * Install a simplex system with StarlingX master
 * Run the Backup Ansible playbook from controller-0
 * Install a clean image of StarlingX master in the system with wipedisk=false
 * Run the restore Ansible playbook locally with the backup file saved above
 * All the application are in applied/uploaded state

Expected Behavior
------------------
All the application are in applied/uploaded state

Actual Behavior
----------------
All the application are in restore-requested state

Reproducibility
---------------
3/3

System Configuration
--------------------
SX and DX (might happen on Standard/Storage as well), ipv4

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
Before nginx-ingress-controller, platform-integ-apps and cert-manage were migrated to fluxcd

Timestamp/Logs
--------------
{code:java}
controller-0:~$ source /etc/platform/openrc
[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname     | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1  | controller-0 | controller  | unlocked       | enabled     | available    |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 ~(keystone_admin)]$ system application-list
+--------------------------+---------+-------------------------------------------+------------------+-------------------+------------------------------------------+
| application              | version | manifest name                             | manifest file    | status            | progress                                 |
+--------------------------+---------+-------------------------------------------+------------------+-------------------+------------------------------------------+
| cert-manager             | 1.0-34  | cert-manager-fluxcd-manifests             | fluxcd-manifests | restore-requested | failed to download one or more image(s). |
| nginx-ingress-controller | 1.1-34  | nginx-ingress-controller-fluxcd-manifests | fluxcd-manifests | restore-requested | failed to download one or more image(s). |
| oidc-auth-apps           | 1.0-66  | oidc-auth-apps-fluxcd-manifests           | fluxcd-manifests | uploaded          | completed                                |
| platform-integ-apps      | 1.0-52  | platform-integ-apps-fluxcd-manifests      | fluxcd-manifests | restore-requested | failed to download one or more image(s). |
| rook-ceph-apps           | 1.0-17  | rook-ceph-manifest                        | manifest.yaml    | uploaded          | completed                                |
+--------------------------+---------+-------------------------------------------+------------------+-------------------+------------------------------------------+ {code}

sysinv.log

{code:java}
sysinv 2022-06-16 00:31:10.165 108519 INFO sysinv.conductor.manager [-] Restore in progress - defer platform managed application activity
sysinv 2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager [-] [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml': IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml'
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6634, in _restore_download_images
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager     self._app.download_images(rapp)
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 867, in download_images
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager     saved_images_list = self._retrieve_images_list(app.sync_imgfile)
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 855, in _retrieve_images_list
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager     with io.open(app_images_file, 'r', encoding='utf-8') as f:
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml'
2022-06-16 00:31:10.166 108519 ERROR sysinv.conductor.manager
sysinv 2022-06-16 00:31:27.241 109357 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2022-06-16-00-31-27 patch
sysinv 2022-06-16 00:31:27.242 109357 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_end.  No changes from mtce/1.0.
sysinv 2022-06-16 00:31:56.590 109357 WARNING keystonemiddleware.auth_token [-] Authorization failed for token: InvalidToken: Token authorization failed
sysinv 2022-06-16 00:32:10.136 108519 INFO sysinv.common.rest_api [-] GET cmd:http://localhost:30001/nfvi-plugins/v1/sw-update hdr:{'Content-type': 'application/json', 'User-Agent': 'sysinv/1.0'} payload:None
sysinv 2022-06-16 00:32:10.137 108519 INFO sysinv.common.rest_api [-] Response={u'status': u'success', u'in-progress': None, u'sw-update-type': None}
sysinv 2022-06-16 00:32:10.145 108519 INFO sysinv.conductor.manager [-] Request downloading images for cert-manager:
sysinv 2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager [-] [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/cert-manager/1.0-34/cert-manager-images.yaml': IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/cert-manager/1.0-34/cert-manager-images.yaml'
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6634, in _restore_download_images
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager     self._app.download_images(rapp)
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 867, in download_images
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager     saved_images_list = self._retrieve_images_list(app.sync_imgfile)
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 855, in _retrieve_images_list
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager     with io.open(app_images_file, 'r', encoding='utf-8') as f:
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/cert-manager/1.0-34/cert-manager-images.yaml'
2022-06-16 00:32:10.155 108519 ERROR sysinv.conductor.manager
sysinv 2022-06-16 00:32:10.161 108519 INFO sysinv.conductor.manager [-] Request downloading images for platform-integ-apps:
sysinv 2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager [-] [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/platform-integ-apps/1.0-52/platform-integ-apps-images.yaml': IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/platform-integ-apps/1.0-52/platform-integ-apps-images.yaml'
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6634, in _restore_download_images
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager     self._app.download_images(rapp)
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 867, in download_images
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager     saved_images_list = self._retrieve_images_list(app.sync_imgfile)
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 855, in _retrieve_images_list
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager     with io.open(app_images_file, 'r', encoding='utf-8') as f:
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/platform-integ-apps/1.0-52/platform-integ-apps-images.yaml'
2022-06-16 00:32:10.169 108519 ERROR sysinv.conductor.manager
sysinv 2022-06-16 00:32:10.172 108519 INFO sysinv.conductor.manager [-] Request downloading images for nginx-ingress-controller:
sysinv 2022-06-16 00:32:10.177 108519 INFO sysinv.conductor.manager [-] Restore in progress - defer platform managed application activity
sysinv 2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager [-] [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml': IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml'
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager Traceback (most recent call last):
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/manager.py", line 6634, in _restore_download_images
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager     self._app.download_images(rapp)
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 867, in download_images
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager     saved_images_list = self._retrieve_images_list(app.sync_imgfile)
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager   File "/usr/lib64/python2.7/site-packages/sysinv/conductor/kube_app.py", line 855, in _retrieve_images_list
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager     with io.open(app_images_file, 'r', encoding='utf-8') as f:
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager IOError: [Errno 2] No such file or directory: '/opt/platform/fluxcd/22.06/nginx-ingress-controller/1.1-34/nginx-ingress-controller-images.yaml'
2022-06-16 00:32:10.178 108519 ERROR sysinv.conductor.manager {code}
fluxcd folder is missed or not restored.

{code:java}
controller-0:~$ ls /opt/platform/
armada  config  device_images  extra  helm  helm_charts  keystone  lost+found  nfv  puppet  sysinv {code}

Test Activity
-------------
Regression Testing

Workaround
----------
- copy the fluxcd folder from another lab
- sudo -u postgres psql -U postgres -d sysinv -c "update kube_app set status='uploaded' where name='cert-manager'"
- lock and unlock controller

Changed in starlingx:
assignee: nobody → Thiago Paiva Brito (outbrito)
Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.7.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/846244
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/97a6c9b64f0b574ce298af3c28aeaee8312e07a0
Submitter: "Zuul (22348)"
Branch: master

commit 97a6c9b64f0b574ce298af3c28aeaee8312e07a0
Author: Thiago Brito <email address hidden>
Date: Thu Jun 16 22:55:33 2022 -0400

    Backup and restore fluxcd folder

    Include /opt/platform/fluxcd folder in the list of folders to backup.
    Also, checks if the armada folder is contained in the backup now that we
    might not have armada apps installed on the system anymore.

    TEST PLAN (AIO-SX)
    PASS Backup & Restore stx.7.0 with fluxcd and armada (rook-ceph) apps
    PASS Backup & Restore stx.7.0 with fluxcd apps only
    PASS Upgrade stx.6.0 to stx.7.0 with only armada apps, passes restore
         but seeing an unrelated problem during upgrade-activate

    Closes-Bug: 1979124
    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I9924113aa0099ec6e703ce61d70132dfed0fbe23

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.