Backup & Restore: AIO-SX restore fails - Recover cephalopods data fails

Bug #1869428 reported by Senthil Mukundakumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------

AIO-SX Platform restore fails during recover ceph data.

E
E TASK [recover-ceph-data : Recover ceph-data] *************************************************************************************************************************************************************************************************************
E fatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Traceback (most recent call last):\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>\n recover_ceph_data()\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data\n stderr=fnull)\n File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output\n raise CalledProcessError(retcode, cmd, output=output)\nsubprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22\n", "stderr_lines": ["Traceback (most recent call last):", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>", " recover_ceph_data()", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data", " stderr=fnull)", " File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output", " raise CalledProcessError(retcode, cmd, output=output)", "subprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22"], "stdout": "Rebuilding monitor data.\n", "stdout_lines": ["Rebuilding monitor data."]}
E
E PLAY RECAP ***********************************************************************************************************************************************************************************************************************************************
E localhost : ok=419  changed=241  unreachable=0 failed=1 

tc_sysinstall/fresh_install/restore_system/restore_helper.py:159: AssertionError

Severity
--------

Major

Steps to Reproduce
------------------
1. Make sure the AIO_SX system is UP & ACTIVE
2. Do a backup from the active controller
3. Re-install active controller
4. scp the back up file to the controller
5. Restore the active controller from backup file
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=/home/sysadmin ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_platform_backup_2020_03_27_14_44_55.tgz"

Expected Behavior
------------------
AIO-SX controller should be restored and ready to be unlocked.

Actual Behavior
----------------
Controller restore fails

Reproducibility
---------------
First time seen this issue in Sanity

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
2020-03-26_19-39-18

Last Pass
---------
2020-03-25_21-02-05

Test Activity
-------------
Regression

Logs:
/folk/cgts-pv/bnr/152/localhost_platform_backup_2020_03_27_14_44_55.tgz

Revision history for this message
Senthil Mukundakumar (smukunda) wrote :
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Problem seems to be caused by /tmp/mon-store not being present (this is a longshot guess at this point).
This file is created when at least one OSD is detected and scanned. In this case the logs print: "Scanning osd-0" but this doesn't happen for us as we only get the message that comes after the OSD scan: "Rebuilding monitor data.\n"

Some questions come out of this: This could be related to the fact that we use NVMe's on this setup? If so, why is it not always reproducing?

To be sure we need:
1. The system left in that state
2. Since this is the first time it happened in sanity we also need improved logging for next time when we hit it or similar issues.

Details:
Problem is a failure of 'ceph-monstore-tool /tmp/mon-store rebuild' executed from recover_ceph_data.py.

Return code is 22, looking over ceph-monstore-tool source code it could be returned from anywhere. Yet, one probability is "EINVAL 22 Invalid argument" which usualy happens if one arg is wrogn, such as missing file.

Looking over the code that calls this command I see we have no logging:
            subprocess.check_output(["ceph-objectstore-tool", "--data-path",
                                     osd, "--op", "update-mon-db",
                                     "--mon-store-path",
                                     mon_store], stderr=fnull)
        print("Rebuilding monitor data.")
        subprocess.check_output(["ceph-monstore-tool", mon_store, "rebuild"],
                                stderr=fnull)

Since this is hard to reproduce we should add better logging that include both stdout & stderr to these two commands and, before the "Rebuilding monitor data" log we should check that mon_store file is present.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - intermittent issue; needs further investigation

tags: added: stx.4.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Mihnea Saracin (msaracin)
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739203

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/739203
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=d1134785afb0196b7c39774a04e18762c7175383
Submitter: Zuul
Branch: master

commit d1134785afb0196b7c39774a04e18762c7175383
Author: Mihnea Saracin <email address hidden>
Date: Fri Jul 3 13:26:53 2020 +0300

    Improve logging at B&R when we recover ceph data

    Add more logging to 'recover_ceph_data.py' script when
    we do a restore of the system.
    Also added a check to abort the restore operation
    if 'wipe_ceph_osds=false' but the ceph osds
    partition is not present on the system.

    Change-Id: Ib32b6573689b1d442a01575c49aea6d75e66c35d
    Partial-Bug: 1869428
    Signed-off-by: Mihnea Saracin <email address hidden>

Revision history for this message
Frank Miller (sensfan22) wrote :

Removed the stx.4.0 tag and added stx.5.0: This is an enhancement to logging and as such does not need to be ported to the r/stx.4.0 branch.

tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Austin Sun (sunausti) wrote :

https://review.opendev.org/#/c/739203/ was merged, should change the status to Fix Released.

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.