Brief Description
-----------------
AIO-SX Platform restore fails during recover ceph data.
E
E TASK [recover-ceph-data : Recover ceph-data] *************************************************************************************************************************************************************************************************************
E [0;31mfatal: [localhost]: FAILED! => {"changed": true, "msg": "non-zero return code", "rc": 1, "stderr": "Traceback (most recent call last):\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>\n recover_ceph_data()\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data\n stderr=fnull)\n File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output\n raise CalledProcessError(retcode, cmd, output=output)\nsubprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22\n", "stderr_lines": ["Traceback (most recent call last):", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 37, in <module>", " recover_ceph_data()", " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1585335334.07-179451876563963/recover_ceph_data.py\", line 33, in recover_ceph_data", " stderr=fnull)", " File \"/usr/lib64/python2.7/subprocess.py\", line 575, in check_output", " raise CalledProcessError(retcode, cmd, output=output)", "subprocess.CalledProcessError: Command '['ceph-monstore-tool', '/tmp/mon-store', 'rebuild']' returned non-zero exit status 22"], "stdout": "Rebuilding monitor data.\n", "stdout_lines": ["Rebuilding monitor data."]}[0m
E
E PLAY RECAP ***********************************************************************************************************************************************************************************************************************************************
E [0;31mlocalhost[0m : [0;32mok=419 [0m [0;33mchanged=241 [0m unreachable=0 [0;31mfailed=1 [0m
tc_sysinstall/fresh_install/restore_system/restore_helper.py:159: AssertionError
Severity
--------
Major
Steps to Reproduce
------------------
1. Make sure the AIO_SX system is UP & ACTIVE
2. Do a backup from the active controller
3. Re-install active controller
4. scp the back up file to the controller
5. Restore the active controller from backup file
ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_platform.yml -e "initial_backup_dir=/home/sysadmin ansible_become_pass=Li69nux* admin_password=Li69nux* backup_filename=localhost_platform_backup_2020_03_27_14_44_55.tgz"
Expected Behavior
------------------
AIO-SX controller should be restored and ready to be unlocked.
Actual Behavior
----------------
Controller restore fails
Reproducibility
---------------
First time seen this issue in Sanity
System Configuration
--------------------
AIO-SX
Branch/Pull Time/Commit
-----------------------
2020-03-26_19-39-18
Last Pass
---------
2020-03-25_21-02-05
Test Activity
-------------
Regression
Logs:
/folk/cgts-pv/bnr/152/localhost_platform_backup_2020_03_27_14_44_55.tgz
Problem seems to be caused by /tmp/mon-store not being present (this is a longshot guess at this point).
This file is created when at least one OSD is detected and scanned. In this case the logs print: "Scanning osd-0" but this doesn't happen for us as we only get the message that comes after the OSD scan: "Rebuilding monitor data.\n"
Some questions come out of this: This could be related to the fact that we use NVMe's on this setup? If so, why is it not always reproducing?
To be sure we need:
1. The system left in that state
2. Since this is the first time it happened in sanity we also need improved logging for next time when we hit it or similar issues.
Details: ceph_data. py.
Problem is a failure of 'ceph-monstore-tool /tmp/mon-store rebuild' executed from recover_
Return code is 22, looking over ceph-monstore-tool source code it could be returned from anywhere. Yet, one probability is "EINVAL 22 Invalid argument" which usualy happens if one arg is wrogn, such as missing file.
Looking over the code that calls this command I see we have no logging:
subprocess .check_ output( ["ceph- objectstore- tool", "--data-path",
osd, "--op", "update-mon-db",
"--mon- store-path" ,
mon_store] , stderr=fnull)
print( "Rebuilding monitor data.")
subprocess. check_output( ["ceph- monstore- tool", mon_store, "rebuild"],
stderr= fnull)
Since this is hard to reproduce we should add better logging that include both stdout & stderr to these two commands and, before the "Rebuilding monitor data" log we should check that mon_store file is present.