Subsequent subcloud backup fails after powering off subcloud during backup

Bug #2002128 reported by Guilherme Schons
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Guilherme Schons

Bug Description

Brief Description
-----------------
The dcmanager fails when retrying to back up subcloud after powering off subcloud during the backup process.

Severity
--------
Major

Steps to Reproduce
------------------
Create subcloud backup

dcmanager subcloud-backup create --subcloud subcloud1 --sysadmin-password Li69nux*

Watch Ansible logs and power off the subcloud via IPMI after reaching the following task: "Run subcloud1 backup playbook"

Backup fails as it cannot connect to the subcloud (it will take some minutes)
'Failed to connect to the host via ssh: ssh: connect to host *** port 22: No route to host'

Check Backup status goes to 'failed' state

Power on the subcloud again

Once subcloud is online/in-sync, retry the subcloud backup creation:
dcmanager subcloud-backup create --subcloud subcloud1 --sysadmin-password Li69nux*

The backup operation fails with the following error:
msg: Backup is already in progress!
Try again and you'll get the same failure.

Expected Behavior
------------------
The operator is able to back up the subcloud again after a power-off failure

Actual Behavior
----------------
The operator can no longer backup a sub cloud whether the sub cloud goes down during a backup

Reproducibility
---------------
100% - Interrupt the backup process and try again, or manually create the .backup_in_progress file.

System Configuration
--------------------
Distributed Cloud

Last Pass
---------
Never tested before.

Timestamp/Logs
--------------
TASK [subcloud-bnr/backup : Remove subcloud overrides file on subcloud1] *******
Tuesday 01 November 2022 18:03:36 +0000 (0:00:00.015) 0:08:36.261 ******
fatal: [subcloud1]: UNREACHABLE! => changed=false
  msg: 'Failed to connect to the host via ssh: ssh: connect to host ***::1016 port 22: No route to host'
  unreachable: true

| 8 | subcloud1 | managed | offline | complete | unknown | failed | None |
+----+-----------+------------+--------------+---------------+---------+---------------+-----------------+

// Powered on Subcloud1 and initiated backup process again

| 8 | subcloud1 | managed | online | complete | in-sync | failed | None |
+----+-----------+------------+--------------+---------------+---------+---------------+-----------------+

$ dcmanager subcloud-backup create --subcloud subcloud1 --sysadmin-password Li69nux*

| 8 | subcloud1 | managed | online | complete | in-sync | backing-up | None |

// Failed as previous backup was still in-progress

| 8 | subcloud1 | managed | online | complete | in-sync | failed | None |

 TASK [backup/prepare-env : Fail if backup is already in progress] **************
    Tuesday 01 November 2022 18:34:37 +0000 (0:00:00.293) 0:00:03.329 ******
    fatal: [localhost]: FAILED! => changed=false
      msg: Backup is already in progress!

Test Activity
-------------
Feature testing

Workaround
----------
SSH into the subcloud > Remove the hidden flag file - ./etc/platform/.backup_in_progress

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/869478
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/2b5a143b189d91432d84df80e13c7f37bce50c2c
Submitter: "Zuul (22348)"
Branch: master

commit 2b5a143b189d91432d84df80e13c7f37bce50c2c
Author: Guilherme Schons <email address hidden>
Date: Fri Jan 6 11:30:57 2023 -0300

    Fix fail to start backup after stopping a backup

    When there is an abrupt disconnect (target is shut down,
    network glitch), the backup_in_progress flag remains,
    resulting in error to start a new backup process.

    This commit adds a new condition to check if backup_in_progress
    flag was created less than 20 minutes ago, so block the process
    otherwise, ignore the flag and continue the backup process.

    Test Plan:
    - PASS: Run backup process AIO-SX without backup_in_progress flag
    - PASS: Run backup process AIO-SX with backup_in_progress flag
      (before 20 minutes since the last backup interrupted)
    - PASS: Run backup process AIO-SX with backup_in_progress flag
      (after 20 minutes since the last backup interrupted)

    Closes-Bug: 2002128
    Signed-off-by: Guilherme Schons <email address hidden>
    Change-Id: I3f51c41005208e96d747b1c89931471b218f709b

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.distcloud
Changed in starlingx:
assignee: nobody → Guilherme Schons (gdossant)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.