race condition causes reformat of osd - ceph-osd processes and charm config-changed hook upon boot

Bug #1698154 reported by Gábor Mészáros
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
Critical
Frode Nordahl

Bug Description

Current behavior:
Running several ceph osd processes, some might be slower to start upon boot, which causes the unit config-changed hook to mistakenly consider those osds down.

Expected behavior:
The config-changed hook should wait if no ceph-osd daemons are starting or respawning

Actual result:
If the osd-reformat option is set, this behaviour causes some osds to be re-osdize'd during the boot process, not waiting for the ceph-osd daemon to finish. By disabling the config option, eventually all osds come up, but we loose the advantages of having osd-reformat set.

Steps to reproduce:
Just (clean) rebooting the node. Although it does not happen always and not always with the same osd.

Tags: 4010 sts
Revision history for this message
James Page (james-page) wrote :

Urgh that's pretty ugly - I think we'll need to record which OSD block devices have already been initialised on a specific unit, and exclude those from any future osd init scans/runs that might run on a config-changed hook execution.

Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → High
milestone: none → 18.02
Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
milestone: 18.02 → 18.05
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Frode Nordahl (fnordahl) wrote :

I believe this bug is fixed by commit in https://review.openstack.org/#/c/559964/

Changed in charm-ceph-osd:
assignee: nobody → James Page (james-page)
status: Triaged → Fix Committed
Revision history for this message
Felipe Reyes (freyes) wrote :

This doesn't seem to address the problem for already deployed clusters, those environments won't have osd-devices populated.

during the upgrade-charm hook is needed some code to populate the key-value store with the devices already osdized

Changed in charm-ceph-osd:
status: Fix Committed → New
Frode Nordahl (fnordahl)
Changed in charm-ceph-osd:
assignee: James Page (james-page) → nobody
Revision history for this message
Trent Lloyd (lathiat) wrote :
Download full text (4.1 KiB)

The currently committed scheme to keep track of which osd-devices were already formatted is likely not a sufficient safe guard and needs further improvement. Reasons are

(1) Device letters change (i.e. when adding a new disk, but it comes up earlier in the device list order). This is very common on production hardware, not a side case. Devices are also not guaranteed to come up in the same order. It would be easy for an existing OSD to get pushed off into a device name that is not listed or previously initialised and accidentally get re-initialised.

(2) New OSDs are added to the host but manually with ceph commands (in a way this is similar to the above). This has happened on production deployments when the add-disk action command was broken. However even if you use the add-disk action, the device may not be listed in the osd-devices config option if the action was taken before the charm stored the list of previously initialised osd-devices (the cation uses osdize(), so an new deployments would populate it - but old deployments won't). Even if we added code to populate this value on existing clusters from the osd-devices config option, OSDs that are in use but not in the config option won't be added to the previously initialised list and thus are still vulnerable to being reformatted.

(3) This is slightly out of use case for the charms, but something I can still see people doing and it would be ideal to avoid data loss from is also moving an existing OSD from one machine to another to recover it, which is something explicitly supported and generally recommended as a perfectly reasonable action within the Ceph community.

In all of the above situations, even with the improved commit, data loss could still occur. Plus a fix is still required for already deployed clusters.

An improvement on this situation would be to have the osd-devices stored list store the OSD FSID(stored in OSDPATH/fsid file), device ID path (e.g. /dev/disk/by-id) or similar instead of the block device path but that won't help with the OSD movement scenario or the manual ceph-disk initialise scenario.

Otherwise one way we could potentially handle this better which would automatically include already deployed OSDs Ceph also stores the Ceph cluster "fsid" which is unique to the current ceph cluster on the OSD in the "OSDPATH/ceph_fsid" file. This would be a much better way to determine if the OSD is part of the current cluster or not. Unsure if there is an easy way to detect this value without first mounting the OSD, though.

Lastly as a minor note, a place that storing the OSD UUID instead is likely to go wrong in future is when doing bluestore conversions, which requires reformatting the OSD which would give it a new UUID. That workflow may need to be considered. Bluestore and Luminous also format and partition the disk differently (in particular LVM may be used instead of GPT in some cases) and this may change or influence any code designed to detect this information.

Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

On the first instance it seems like this option would be used ...

Read more...

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1698154] Re: race condition causes reformat of osd - ceph-osd processes and charm config-changed hook upon boot

On Tue, May 29, 2018 at 05:57:56AM -0000, Trent Lloyd wrote:
> The currently committed scheme to keep track of which osd-devices were
> already formatted is likely not a sufficient safe guard and needs
> further improvement. Reasons are
>
> (1) Device letters change (i.e. when adding a new disk, but it comes up
> earlier in the device list order). This is very common on production
> hardware, not a side case. Devices are also not guaranteed to come up in
> the same order. It would be easy for an existing OSD to get pushed off
> into a device name that is not listed or previously initialised and
> accidentally get re-initialised.

If a user is concerned about this, they should use /dev/disk/{by-path,by-id} in the osd-devices option, I don't believe we have any restriction on the usage of those paths, we just check the prefix '/dev/' to check if the argument is a directory or a disk device.

>
> (2) New OSDs are added to the host but manually with ceph commands (in a
> way this is similar to the above). This has happened on production
> deployments when the add-disk action command was broken.

We could add an action to mark a device as osdized to allow an operator merge the view of the charm with what the system is actually running.

> However even if you use the add-disk action, the device may not be
> listed in the osd- devices config option if the action was taken
> before the charm stored the list of previously initialised osd-devices
> (the cation uses osdize(), so an new deployments would populate it -
> but old deployments won't).

As I mentioned in comment #3, the upgrade-charm needs to populate the osd-devices list.

> Even if we added code to populate this value on existing clusters from
> the osd-devices config option, OSDs that are in use but not in the
> config option won't be added to the previously initialised list and
> thus are still vulnerable to being reformatted.

osd-reformat is not an option someone should have set in this case and in conjunction with the action I suggest previously a user could get to state where the charm is in sync with the running system.

>
> (3) This is slightly out of use case for the charms, but something I can
> still see people doing and it would be ideal to avoid data loss from is
> also moving an existing OSD from one machine to another to recover it,
> which is something explicitly supported and generally recommended as a
> perfectly reasonable action within the Ceph community.

In this case the user should have osd-reformat set to false (or "").

[...]
> Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

This has been discussed, but the lack of support in the bundles to run actions at the end of a deployment make it non an option.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

We'll be designing and implementing an action-driven approach to handle cases when a formatted device is presented for reformat. The charm will require operator decision and input in those cases.

This brings a long-standing issue with fsid re-use back to the forefront. We will need to evaluate the usefulness of explicit fsid charm config in the current world where the charms now handle that automatically. The issue with fsid re-use is that a formatted block device may contain ceph artifacts from a prior deployment, but the fsid was statically-configured, the fsid matches the current deployment, and ceph tries (unsuccessfully) to re-use it without formatting it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/571486

Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
importance: High → Critical
status: New → In Progress
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
Ryan Beisner (1chb1n) wrote :

This topic link represents the WIP reviews for those tracking progress:

https://review.openstack.org/#/q/topic:bug/1698154+(status:open+OR+status:merged)

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I just noticed this bug - https://bugs.launchpad.net/charm-ceph-osd/+bug/1746118 - which is similar but not the same as this one but I believe they can/should both leverage the same fix but with a slight adjustment. 1746118 does a udevadm settle after having formatted and started osds to ensure they don't get reformatted on a subsequent hook fire but before all the new osds have been mounted. The trouble is that the settle is done after the formatting where we should really be doing it before as well (maybe just if osd-reformat=True) so as to ensure that there are not devices still waiting to be mounted before the code checks for unmounted devices.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Addendum to my previous comment: i do agree that moving this to an action is good idea though but if we plan to keep the osd-reformat config option then we should fix it as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/571486
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=487658abe0d098bcedfd3eb2ea297ae7f6067923
Submitter: Zuul
Branch: master

commit 487658abe0d098bcedfd3eb2ea297ae7f6067923
Author: Chris MacNaughton <email address hidden>
Date: Thu May 31 15:50:06 2018 +0200

    Add action to zap disk(s)

    This action includes configuration for disk(s) to
    zap, as well as an additional required flag for
    the administrator to acknowledge pending data loss

    Change-Id: I3106e2f10cf132a628aad025f73161b04215598e
    Related-Bug: #1698154

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/571452
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=22ce311b0b0b3a88abc92f7e5241bf9fd0ab947d
Submitter: Zuul
Branch: master

commit 22ce311b0b0b3a88abc92f7e5241bf9fd0ab947d
Author: Ryan Beisner <email address hidden>
Date: Thu May 31 07:00:33 2018 -0500

    No reformat

    Do not reformat devices. A subsequent change will be necessary
    to account for conditions where a reformat is still desired,
    such as a set of blocking states and user-driven actions.

    Partial-bug: #1698154

    Depends-On: I90a866aa138d18e4242783c42d4c7c587f696d7d
    Change-Id: I3a41ab38e7a1679cf4f5380a7cc56556da3aaf2b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/571690
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=352d6993870be2547f37463e4e3cffc7605f749c
Submitter: Zuul
Branch: master

commit 352d6993870be2547f37463e4e3cffc7605f749c
Author: Frode Nordahl <email address hidden>
Date: Fri Jun 1 12:15:01 2018 +0200

    Add pre-flight check for device pristinity

    Add `non-pristine` key to `list-disks` action.

    No longer attempt to do initializtion of `osd-journal` devices.

    Make py27 test noop

    Flip pep8 test to py3

    Partial-Bug: #1698154
    Change-Id: I0ca574fa7f0683b4e8a693b9f62fbf6b39689789
    Depends-On: I90a866aa138d18e4242783c42d4c7c587f696d7d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/572696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/572696
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=1f68a5874e66061e61dea03f8e8aa0aded7385c9
Submitter: Zuul
Branch: master

commit 1f68a5874e66061e61dea03f8e8aa0aded7385c9
Author: Frode Nordahl <email address hidden>
Date: Wed Jun 6 09:59:32 2018 +0200

    Update tests to not set `fsid` and `monitor-secret`

    Get `fsid` from leader settings on ceph-mon unit where needed
    for validation.

    Change-Id: I751ecff76873a599c0d03ec1308e30e615e38aa8
    Related-Bug: #1698154

Frode Nordahl (fnordahl)
Changed in charm-ceph-osd:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.