StarlingX

Backup & Restore: AIO-SX restore fails - Recover cephalopods data fails

Bug #1869428 reported by Senthil Mukundakumar on 2020-03-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Mihnea Saracin

Bug Description

Brief Description
-----------------

AIO-SX Platform restore fails during recover ceph data.

E
E TASK [recover-ceph-data : Recover ceph-data] *************************************************************************************************************************************************************************************************************
E [0;31mfatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Traceback (most recent call last):\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>\n recover_ceph_data()\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data\n stderr=fnull)\n File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output\n raise CalledProcessError(retcode, cmd, output=output)\nsubprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22\n", "stderr_lines": ["Traceback (most recent call last):", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>", " recover_ceph_data()", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data", " stderr=fnull)", " File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output", " raise CalledProcessError(retcode, cmd, output=output)", "subprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22"], "stdout": "Rebuilding monitor data.\n", "stdout_lines": ["Rebuilding monitor data."]}[0m
E
E PLAY RECAP ***********************************************************************************************************************************************************************************************************************************************
E [0;31mlocalhost[0m : [0;32mok=419 [0m [0;33mchanged=241 [0m unreachable=0 [0;31mfailed=1 [0m

tc_sysinstall/fresh_install/restore_system/restore_helper.py:159: AssertionError

Severity
--------

Major

Steps to Reproduce
------------------
1. Make sure the AIO_SX system is UP & ACTIVE
2. Do a backup from the active controller
3. Re-install active controller
4. scp the back up file to the controller
5. Restore the active controller from backup file
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=/home/sysadmin ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_platform_backup_2020_03_27_14_44_55.tgz"

Expected Behavior
------------------
AIO-SX controller should be restored and ready to be unlocked.

Actual Behavior
----------------
Controller restore fails

Reproducibility
---------------
First time seen this issue in Sanity

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
2020-03-26_19-39-18

Last Pass
---------
2020-03-25_21-02-05

Test Activity
-------------
Regression

Logs:
/folk/cgts-pv/bnr/152/localhost_platform_backup_2020_03_27_14_44_55.tgz

Tags:

Revision history for this message

Senthil Mukundakumar (smukunda) wrote on 2020-03-27:

ALL_NODES_20200327.185545.tar Edit (31.0 MiB, application/x-tar)

Revision history for this message

Ovidiu Poncea (ovidiuponcea) wrote on 2020-03-28:

Problem seems to be caused by /tmp/mon-store not being present (this is a longshot guess at this point).
This file is created when at least one OSD is detected and scanned. In this case the logs print: "Scanning osd-0" but this doesn't happen for us as we only get the message that comes after the OSD scan: "Rebuilding monitor data.\n"

Some questions come out of this: This could be related to the fact that we use NVMe's on this setup? If so, why is it not always reproducing?

To be sure we need:
1. The system left in that state
2. Since this is the first time it happened in sanity we also need improved logging for next time when we hit it or similar issues.

Details:
Problem is a failure of 'ceph-monstore-tool /tmp/mon-store rebuild' executed from recover_ceph_data.py.

Return code is 22, looking over ceph-monstore-tool source code it could be returned from anywhere. Yet, one probability is "EINVAL 22 Invalid argument" which usualy happens if one arg is wrogn, such as missing file.

Looking over the code that calls this command I see we have no logging:
            subprocess.check_output(["ceph-objectstore-tool", "--data-path",
                                     osd, "--op", "update-mon-db",
                                     "--mon-store-path",
                                     mon_store], stderr=fnull)
        print("Rebuilding monitor data.")
        subprocess.check_output(["ceph-monstore-tool", mon_store, "rebuild"],
                                stderr=fnull)

Since this is hard to reproduce we should add better logging that include both stdout & stderr to these two commands and, before the "Rebuilding monitor data" log we should check that mon_store file is present.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-04-01:

stx.4.0 / medium priority - intermittent issue; needs further investigation

tags:	added: stx.4.0 stx.storage
Changed in starlingx:
importance:	Undecided → Medium
status:	New → Triaged
assignee:	nobody → Mihnea Saracin (msaracin)

Ghada Khalil (gkhalil) on 2020-06-28

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-03: Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739203

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-16: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/739203
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=d1134785afb0196b7c39774a04e18762c7175383
Submitter: Zuul
Branch: master

commit d1134785afb0196b7c39774a04e18762c7175383
Author: Mihnea Saracin <email address hidden>
Date: Fri Jul 3 13:26:53 2020 +0300

Improve logging at B&R when we recover ceph data

    Add more logging to 'recover_ceph_data.py' script when
    we do a restore of the system.
    Also added a check to abort the restore operation
    if 'wipe_ceph_osds=false' but the ceph osds
    partition is not present on the system.

    Change-Id: Ib32b6573689b1d442a01575c49aea6d75e66c35d
    Partial-Bug: 1869428
    Signed-off-by: Mihnea Saracin <email address hidden>

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-07-16:

Removed the stx.4.0 tag and added stx.5.0: This is an enhancement to logging and as such does not need to be ported to the r/stx.4.0 branch.

tags:

added: stx.5.0
removed: stx.4.0

Revision history for this message

Austin Sun (sunausti) wrote on 2020-09-16:

https://review.opendev.org/#/c/739203/ was merged, should change the status to Fix Released.

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2021-10-27

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ALL_NODES_20200327.185545.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.