Brief Description
-----------------
Ceph mon and OSDs failed to be brought up during the restore of AIOSX
Severity
--------
Critical - AIOSX failed to restore from backup data, this failure also blocks the development of AIOSX subcloud restore.
Steps to reproduce
------------------
Backup an AIOSX system either locally or remotely
Restore the AIOSX system from the backup either locally or remotely
Expected Behavior
------------------
The system is restored from the backup data successfully
Actual Behavior
----------------
The AIOSX system failed to restore during ceph bringup. This might be related to the cephfs feature introduced late last year.
TASK [recover-ceph-data : Bring up ceph Monitor and OSDs] **************************************************
fatal: [wcp111]: FAILED! => {"changed": true, "cmd": ["/etc/init.d/ceph", "start"], "delta": "0:50:02.034969", "end": "2021-01-19 19:47:18.128137", "msg": "non-zero return code", "rc": 1, "start": "2021-01-19 18:57:16.093168", "stderr": "failed to fetch mon config (--no-mon-config to skip)\n2021-01-19 19:47:16.858 7f34896d3140 -1 mon.controller-0@-1(probing).mgr e1 Failed to load mgr commands: (2) No such file or directory\n2021-01-19 19:47:17.916 7ff86a0091c0 -1 journal do_read_entry(3395584): bad header magic\n2021-01-19 19:47:17.916 7ff86a0091c0 -1 journal do_read_entry(3395584): bad header magic\n2021-01-19 19:47:17.997 7ff86a0091c0 -1 osd.0 19 log_to_monitors {default=true}", "stderr_lines": ["failed to fetch mon config (--no-mon-config to skip)", "2021-01-19 19:47:16.858 7f34896d3140 -1 mon.controller-0@-1(probing).mgr e1 Failed to load mgr commands: (2) No such file or directory", "2021-01-19 19:47:17.916 7ff86a0091c0 -1 journal do_read_entry(3395584): bad header magic", "2021-01-19 19:47:17.916 7ff86a0091c0 -1 journal do_read_entry(3395584): bad header magic", "2021-01-19 19:47:17.997 7ff86a0091c0 -1 osd.0 19 log_to_monitors {default=true}"], "stdout": "=== mds.controller-0 === \nStarting Ceph mds.controller-0 on controller-0...\nfailed: 'ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 /usr/bin/ceph-mds -i controller-0 --pid-file /var/run/ceph/mds.controller-0.pid -c /etc/ceph/ceph.conf --cluster ceph '\n=== mon.controller-0 === \nStarting Ceph mon.controller-0 on controller-0...\n=== osd.0 === \nStarting Ceph osd.0 on controller-0...\nstarting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal", "stdout_lines": ["=== mds.controller-0 === ", "Starting Ceph mds.controller-0 on controller-0...", "failed: 'ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 /usr/bin/ceph-mds -i controller-0 --pid-file /var/run/ceph/mds.controller-0.pid -c /etc/ceph/ceph.conf --cluster ceph '", "=== mon.controller-0 === ", "Starting Ceph mon.controller-0 on controller-0...", "=== osd.0 === ", "Starting Ceph osd.0 on controller-0...", "starting osd.0 at - osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal"]}
Reproducibility
---------------
Only tried once but likely is reproducible
System Configuration
--------------------
AIOSX but may also affect the restore of AIODX and standard configurations as well.
Branch/Pull Time/Commit
-----------------------
Jan. 18th master load
Last Pass
---------
Last successful restore test was conducted by the test team on Dec. 12th, 2020 with AIODX.
Timestamp/Logs
--------------
See attached
Test Activity
-------------
Developer Testing
Workaround
----------
Fix posted: https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/771764/