B&R: Running 'wipedisk' before a restore breaks Ceph OSDs leading to data loss

Bug #1885560 reported by Ovidiu Poncea
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Mihnea Saracin

Bug Description

Brief Description
-----------------
Running wipedisk before a restore leads to jorunal data getting wiped. W/o a valid journal Ceph fails to restore.

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
1. Configure an AIO-SX
2. Backup the system
3. Run wipedisk
4. Reinstall ISO
5. Run ansible restore playbook

Expected Behavior
------------------
Ansible should finish

Actual Behavior
----------------
Ansible fails with:

Error message:
2020-06-24 15:12:20,711 p=11425 u=sysadmin | TASK [recover-ceph-data : Bring up ceph Monitor and OSDs] **********************
2020-06-24 15:12:22,963 p=11425 u=sysadmin | fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/etc/init.d/ceph", "start"], "delta": "0:00:02.003026", "end": "2020-06-24 15:12:22.933556", "msg": "non-zero return code", "rc": 1, "start": "2020-06-24 15:12:20.930530", "stderr": "2020-06-24 15:12:21.790 7f4307d66140 -1 mon.controller-0@-1(probing).mgr e1 Failed to load mgr commands: (2) No such file or directory\n2020-06-24 15:12:22.901 7fd4c2d011c0 -1 journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected ff145265-25f5-4986-87ca-398fcad6fd5b, invalid (someone else's?) journal\n2020-06-24 15:12:22.902 7fd4c2d011c0 -1 filestore(/var/lib/ceph/osd/ceph-0) mount(1871): failed to open journal /var/lib/ceph/osd/ceph-0/journal: (22) Invalid argument\n2020-06-24 15:12:22.902 7fd4c2d011c0 -1 osd.0 0 OSD:init: unable to mount object store\n2020-06-24
[….]

Reproducibility
---------------
Reproducible

System Configuration
--------------------
SX, DX, Standard with Ceph enabled.

Branch/Pull Time/Commit
-----------------------
master

Test Activity
-------------
Developer Testing

Workaround
----------
avoid running wipedisk during B&R

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :
summary: - Running wipedisk before a restore breaks Ceph OSDs leading to data loss
+ B&R: Running 'wipedisk' before a restore breaks Ceph OSDs leading to
+ data loss
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Although https://review.opendev.org/#/c/729599/ fixes an issue, it also adds this subtle problem.

The issue is caused by the fact that, before ansible restore is executed, we don't know that the system is doing B&R so we can't avoid wiping Ceph's journals. Therefore the solution is to move this fix to the kickstarts as it has more info on the ongoing operation (normal install, upgrades & B&R).

Note that an ansible change will also be needed to wipe_osds.sh to extend wiping of Ceph Journals in the reinstall case.

Ghada Khalil (gkhalil)
tags: added: stx.storage
description: updated
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741413

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 - issue can be avoided by not wiping the disk before a restore

tags: added: stx.5.0
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/741413
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=ff4ba7f84fa4627413af4b5099212b7791d75d77
Submitter: Zuul
Branch: master

commit ff4ba7f84fa4627413af4b5099212b7791d75d77
Author: Mihnea Saracin <email address hidden>
Date: Thu Jul 16 12:03:07 2020 +0300

    Fix wipedisk to not break Ceph OSDs during B&R

    Before the ansible restore playbook is executed,
    we don't know that the system is doing the
    B&R procedure so we can't avoid wiping Ceph journals
    even if `wipe_ceph_osds` variable is set to false.
    So we fix this by moving the code that
    is handling the removal of the Ceph journals
    in the kickstart files because there
    we can check if the system is doing B&R.

    Closes-Bug: 1885560
    Change-Id: Ie9676de1eb4b6e3868db724f04f647b33660b842
    Signed-off-by: Mihnea Saracin <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.