Lock a host during a backup in progress causes future upgrades to fail

Bug #1990984 reported by Luis Eduardo Angelini Marquitti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Luis Eduardo Angelini Marquitti

Bug Description

Brief Description
-----------------
Orchestrated upgrade failed on a subcloud because the last backup run did not complete correctly.
The /etc/platform/.backup_in_progress flag was not cleared because the user locked the host while performing the backup.

Severity
--------
Major

Steps to Reproduce
------------------
Start a backup process
Lock the host before the backup process ends
Try performing an upgrade or another backup on the host

Expected Behavior
------------------
Prevent the user from locking the host while the backup is running

Actual Behavior
----------------
User can lock host while backup is running, causing future backups and upgrades to fail

Reproducibility
---------------
Reproducible

System Configuration
--------------------
AIO-SX
SX Subcloud

Branch/Pull Time/Commit
-----------------------
-

Last Pass
---------
-

Timestamp/Logs
--------------
TASK [backup/prepare-env : Check if backup is in progress] *********************
ok: [localhost]TASK [backup/prepare-env : Fail if backup is already in progress] **************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Backup is already in progress!"}PLAY RECAP *********************************************************************
localhost : ok=6 changed=2 unreachable=0 failed=1
sysinv 2022-08-11 04:04:26.135 12124 INFO sysinv.agent.manager [-] Exception during simplex upgrade data collection
sysinv 2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager [-] Command '['ansible-playbook', '-e', 'platform_backup_file=upgrade_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz docker_local_registry_backup_file=upgrade_images_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz backup_user_local_registry=true backup_dir=/opt/platform-backup', '/usr/share/ansible/stx-ansible/playbooks/backup.yml']' returned non-zero exit status 2: CalledProcessError: Command '['ansible-playbook', '-e', 'platform_backup_file=upgrade_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz docker_local_registry_backup_file=upgrade_images_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz backup_user_local_registry=true backup_dir=/opt/platform-backup', '/usr/share/ansible/stx-ansible/playbooks/backup.yml']' returned non-zero exit status 2
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager Traceback (most recent call last):
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/sysinv/agent/manager.py", line 1819, in create_simplex_backup
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager upgrades_management.create_simplex_backup(software_upgrade)
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager File "/usr/lib64/python2.7/site-packages/controllerconfig/upgrades/management.py", line 248, in create_simplex_backup
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager raise subprocess.CalledProcessError(proc.returncode, args)
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager CalledProcessError: Command '['ansible-playbook', '-e', 'platform_backup_file=upgrade_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz docker_local_registry_backup_file=upgrade_images_data_2022-08-11T040422_e742fc22-18d8-4623-9cab-7df342dbf7d1.tgz backup_user_local_registry=true backup_dir=/opt/platform-backup', '/usr/share/ansible/stx-ansible/playbooks/backup.yml']' returned non-zero exit status 2
2022-08-11 04:04:26.136 12124 ERROR sysinv.agent.manager

Test Activity
-------------
Backup or Upgrade execution

Workaround
----------
Manually delete the /etc/platform/.backup_in_progress flag and rerun the backup or upgrade

Changed in starlingx:
assignee: nobody → Luis Eduardo Angelini Marquitti (leduard1)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/859441

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.8.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/859441
Committed: https://opendev.org/starlingx/config/commit/6421b57e7fd8fee550579a2652c47e8303c2e68e
Submitter: "Zuul (22348)"
Branch: master

commit 6421b57e7fd8fee550579a2652c47e8303c2e68e
Author: Luis Eduardo Angelini Marquitti <email address hidden>
Date: Tue Sep 27 10:37:47 2022 -0400

    Handling backup_in_progress flag

    Added a semantic check to prevent user lock host during a backup process
    checking if the flag /etc/platform/.backup_in_progress is active.
    Lock a host during a backup in progress may cause the backup to fail to
    complete correctly and the flag not to be cleared as expected at the end
    of the process. This can lead to failures to start future backup or SX
    upgrade processes.
    On an SX, for the cases where the backup was not finished correctly,
    there is no backup in progress alarm, and the flag is still present,
    when the upgrade starts it will be removed before starting the backup
    process that runs during upgrade start. Avoiding upgrade start failure
    also in cases of orchestrated upgrade on SX subclouds.

    Test plan:

    PASS: Tried to lock a host during a backup run and received a message
    that the lock action was not performed because there is a backup in
    progress;
    PASS: Tried to lock a host after backup failed and flag was not cleared,
    and received a message that the lock action was not performed because
    there is a backup in progress
    PASS: After system-health-query-upgrade and fm-alarm list did not show
    a backup in progress alarm, but with the flag present, an upgrade was
    attempted. The flag was cleared and the upgrade started.

    Closes-Bug: 1990984
    Change-Id: I747d79f41cff41b4990cc40cff2150a73c10b056
    Signed-off-by: Luis Eduardo Angelini Marquitti <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.