Bug #1698154 “race condition causes reformat of osd - ceph-osd p...” : Bugs : Ceph OSD Charm

Revision history for this message

James Page (james-page) wrote on 2017-09-21:

#1

Urgh that's pretty ugly - I think we'll need to record which OSD block devices have already been initialised on a specific unit, and exclude those from any future osd init scans/runs that might run on a config-changed hook execution.

Changed in charm-ceph-osd:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 18.02

Ryan Beisner (1chb1n) on 2018-03-09

Changed in charm-ceph-osd:
milestone:	18.02 → 18.05

Felipe Reyes (freyes) on 2018-05-28

tags:

added: sts

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2018-05-28:

#2

I believe this bug is fixed by commit in https://review.openstack.org/#/c/559964/

Changed in charm-ceph-osd:
assignee:	nobody → James Page (james-page)
status:	Triaged → Fix Committed

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-05-28:

#3

This doesn't seem to address the problem for already deployed clusters, those environments won't have osd-devices populated.

during the upgrade-charm hook is needed some code to populate the key-value store with the devices already osdized

Changed in charm-ceph-osd:
status:	Fix Committed → New

Frode Nordahl (fnordahl) on 2018-05-28

Changed in charm-ceph-osd:
assignee:	James Page (james-page) → nobody

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-05-29:

#4

Download full text (4.1 KiB)

The currently committed scheme to keep track of which osd-devices were already formatted is likely not a sufficient safe guard and needs further improvement. Reasons are

(1) Device letters change (i.e. when adding a new disk, but it comes up earlier in the device list order). This is very common on production hardware, not a side case. Devices are also not guaranteed to come up in the same order. It would be easy for an existing OSD to get pushed off into a device name that is not listed or previously initialised and accidentally get re-initialised.

(2) New OSDs are added to the host but manually with ceph commands (in a way this is similar to the above). This has happened on production deployments when the add-disk action command was broken. However even if you use the add-disk action, the device may not be listed in the osd-devices config option if the action was taken before the charm stored the list of previously initialised osd-devices (the cation uses osdize(), so an new deployments would populate it - but old deployments won't). Even if we added code to populate this value on existing clusters from the osd-devices config option, OSDs that are in use but not in the config option won't be added to the previously initialised list and thus are still vulnerable to being reformatted.

(3) This is slightly out of use case for the charms, but something I can still see people doing and it would be ideal to avoid data loss from is also moving an existing OSD from one machine to another to recover it, which is something explicitly supported and generally recommended as a perfectly reasonable action within the Ceph community.

In all of the above situations, even with the improved commit, data loss could still occur. Plus a fix is still required for already deployed clusters.

An improvement on this situation would be to have the osd-devices stored list store the OSD FSID(stored in OSDPATH/fsid file), device ID path (e.g. /dev/disk/by-id) or similar instead of the block device path but that won't help with the OSD movement scenario or the manual ceph-disk initialise scenario.

Otherwise one way we could potentially handle this better which would automatically include already deployed OSDs Ceph also stores the Ceph cluster "fsid" which is unique to the current ceph cluster on the OSD in the "OSDPATH/ceph_fsid" file. This would be a much better way to determine if the OSD is part of the current cluster or not. Unsure if there is an easy way to detect this value without first mounting the OSD, though.

Lastly as a minor note, a place that storing the OSD UUID instead is likely to go wrong in future is when doing bluestore conversions, which requires reformatting the OSD which would give it a new UUID. That workflow may need to be considered. Bluestore and Luminous also format and partition the disk differently (in particular LVM may be used instead of GPT in some cases) and this may change or influence any code designed to detect this information.

Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

On the first instance it seems like this option would be used ...

The currently committed scheme to keep track of which osd-devices were already formatted is likely not a sufficient safe guard and needs further improvement.  Reasons are

(1) Device letters change (i.e. when adding a new disk, but it comes up earlier in the device list order). This is very common on production hardware, not a side case. Devices are also not guaranteed to come up in the same order. It would be easy for an existing OSD to get pushed off into a device name that is not listed or previously initialised and accidentally get re-initialised.

(2) New OSDs are added to the host but manually with ceph commands (in a way this is similar to the above). This has happened on production deployments when the add-disk action command was broken.  However even if you use the add-disk action, the device may not be listed in the osd-devices config option if the action was taken before the charm stored the list of previously initialised osd-devices (the cation uses osdize(), so an new deployments would populate it - but old deployments won't). Even if we added code to populate this value on existing clusters from the osd-devices config option, OSDs that are in use but not in the config option won't be added to the previously initialised list and thus are still vulnerable to being reformatted.

(3) This is slightly out of use case for the charms, but something I can still see people doing and it would be ideal to avoid data loss from is also moving an existing OSD from one machine to another to recover it, which is something explicitly supported and generally recommended as a perfectly reasonable action within the Ceph community.

In all of the above situations, even with the improved commit, data loss could still occur. Plus a fix is still required for already deployed clusters.

An improvement on this situation would be to have the osd-devices stored list store the OSD FSID(stored in OSDPATH/fsid file), device ID path (e.g. /dev/disk/by-id) or similar instead of the block device path but that won't help with the OSD movement scenario or the manual ceph-disk initialise scenario.

Otherwise one way we could potentially handle this better which would automatically include already deployed OSDs Ceph also stores the Ceph cluster "fsid" which is unique to the current ceph cluster on the OSD in the "OSDPATH/ceph_fsid" file.  This would be a much better way to determine if the OSD is part of the current cluster or not.  Unsure if there is an easy way to detect this value without first mounting the OSD, though.

Lastly as a minor note, a place that storing the OSD UUID instead is likely to go wrong in future is when doing bluestore conversions, which requires reformatting the OSD which would give it a new UUID. That workflow may need to be considered. Bluestore and Luminous also format and partition the disk differently (in particular LVM may be used instead of GPT in some cases) and this may change or influence any code designed to detect this information.

Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

On the first instance it seems like this option would be used mostly in testing but looking at it's use in actual deployments I think it's often used because an environment may be redeployed a couple of times during pre-deployment testing.  Having this option turned on for long term production is dangerous (even with the safe guards) and I think a juju action that instructions are given for in the status output would replace it sufficiently, and prevent the likely reality that the option is instead just left on in production (as we have seen in most deployments now).  I don't think it's reasonable to expect users to remember to turn the option off.

Even if we kept the option for test/development scenarios, by adding a juju action to complement it, we could ensure that osd-reformat is set to False by default in real deployment scenarios. So I think this option is really needed either way.

The majority of example (e.g. openstack-base) and actual production bundles I have looked at currently all set this option to yes.

Revision history for this message

Felipe Reyes (freyes) wrote on 2018-05-29: Re: [Bug 1698154] Re: race condition causes reformat of osd - ceph-osd processes and charm config-changed hook upon boot

#5

On Tue, May 29, 2018 at 05:57:56AM -0000, Trent Lloyd wrote:
> The currently committed scheme to keep track of which osd-devices were
> already formatted is likely not a sufficient safe guard and needs
> further improvement. Reasons are
>
> (1) Device letters change (i.e. when adding a new disk, but it comes up
> earlier in the device list order). This is very common on production
> hardware, not a side case. Devices are also not guaranteed to come up in
> the same order. It would be easy for an existing OSD to get pushed off
> into a device name that is not listed or previously initialised and
> accidentally get re-initialised.

If a user is concerned about this, they should use /dev/disk/{by-path,by-id} in the osd-devices option, I don't believe we have any restriction on the usage of those paths, we just check the prefix '/dev/' to check if the argument is a directory or a disk device.

>
> (2) New OSDs are added to the host but manually with ceph commands (in a
> way this is similar to the above). This has happened on production
> deployments when the add-disk action command was broken.

We could add an action to mark a device as osdized to allow an operator merge the view of the charm with what the system is actually running.

> However even if you use the add-disk action, the device may not be
> listed in the osd- devices config option if the action was taken
> before the charm stored the list of previously initialised osd-devices
> (the cation uses osdize(), so an new deployments would populate it -
> but old deployments won't).

As I mentioned in comment #3, the upgrade-charm needs to populate the osd-devices list.

> Even if we added code to populate this value on existing clusters from
> the osd-devices config option, OSDs that are in use but not in the
> config option won't be added to the previously initialised list and
> thus are still vulnerable to being reformatted.

osd-reformat is not an option someone should have set in this case and in conjunction with the action I suggest previously a user could get to state where the charm is in sync with the running system.

>
> (3) This is slightly out of use case for the charms, but something I can
> still see people doing and it would be ideal to avoid data loss from is
> also moving an existing OSD from one machine to another to recover it,
> which is something explicitly supported and generally recommended as a
> perfectly reasonable action within the Ceph community.

In this case the user should have osd-reformat set to false (or "").

[...]
> Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

This has been discussed, but the lack of support in the bundles to run actions at the end of a deployment make it non an option.

On Tue, May 29, 2018 at 05:57:56AM -0000, Trent Lloyd wrote:
> The currently committed scheme to keep track of which osd-devices were
> already formatted is likely not a sufficient safe guard and needs
> further improvement.  Reasons are
> 
> (1) Device letters change (i.e. when adding a new disk, but it comes up
> earlier in the device list order). This is very common on production
> hardware, not a side case. Devices are also not guaranteed to come up in
> the same order. It would be easy for an existing OSD to get pushed off
> into a device name that is not listed or previously initialised and
> accidentally get re-initialised.

If a user is concerned about this, they should use /dev/disk/{by-path,by-id} in the osd-devices option, I don't believe we have any restriction on the usage of those paths, we just check the prefix '/dev/' to check if the argument is a directory or a disk device.

> 
> (2) New OSDs are added to the host but manually with ceph commands (in a
> way this is similar to the above). This has happened on production
> deployments when the add-disk action command was broken.

We could add an action to mark a device as osdized to allow an operator merge the view of the charm with what the system is actually running.

> However even if you use the add-disk action, the device may not be
> listed in the osd- devices config option if the action was taken
> before the charm stored the list of previously initialised osd-devices
> (the cation uses osdize(), so an new deployments would populate it -
> but old deployments won't).

As I mentioned in comment #3, the upgrade-charm needs to populate the osd-devices list.

> Even if we added code to populate this value on existing clusters from
> the osd-devices config option, OSDs that are in use but not in the
> config option won't be added to the previously initialised list and
> thus are still vulnerable to being reformatted.

osd-reformat is not an option someone should have set in this case and in conjunction with the action I suggest previously a user could get to state where the charm is in sync with the running system.

> 
> (3) This is slightly out of use case for the charms, but something I can
> still see people doing and it would be ideal to avoid data loss from is
> also moving an existing OSD from one machine to another to recover it,
> which is something explicitly supported and generally recommended as a
> perfectly reasonable action within the Ceph community.

In this case the user should have osd-reformat set to false (or "").

[...]
> Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

This has been discussed, but the lack of support in the bundles to run actions at the end of a deployment make it non an option.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-05-30:

#6

We'll be designing and implementing an action-driven approach to handle cases when a formatted device is presented for reformat. The charm will require operator decision and input in those cases.

This brings a long-standing issue with fsid re-use back to the forefront. We will need to evaluate the usefulness of explicit fsid charm config in the current world where the charms now handle that automatically. The issue with fsid re-use is that a formatted block device may contain ceph artifacts from a prior deployment, but the fsid was statically-configured, the fsid matches the current deployment, and ceph tries (unsuccessfully) to re-use it without formatting it.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Related fix proposed to charm-ceph-osd (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/571486

Ryan Beisner (1chb1n) on 2018-05-31

Changed in charm-ceph-osd:
importance:	High → Critical
status:	New → In Progress
assignee:	nobody → Frode Nordahl (fnordahl)

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-05-31:

#8

This topic link represents the WIP reviews for those tracking progress:

https://review.openstack.org/#/q/topic:bug/1698154+(status:open+OR+status:merged)

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-06-01:

#9

I just noticed this bug - https://bugs.launchpad.net/charm-ceph-osd/+bug/1746118 - which is similar but not the same as this one but I believe they can/should both leverage the same fix but with a slight adjustment. 1746118 does a udevadm settle after having formatted and started osds to ensure they don't get reformatted on a subsequent hook fire but before all the new osds have been mounted. The trouble is that the settle is done after the formatting where we should really be doing it before as well (maybe just if osd-reformat=True) so as to ensure that there are not devices still waiting to be mounted before the code checks for unmounted devices.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-06-01:

#10

Addendum to my previous comment: i do agree that moving this to an action is good idea though but if we plan to keep the osd-reformat config option then we should fix it as well.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Related fix merged to charm-ceph-osd (master)

#11

Reviewed: https://review.openstack.org/571486
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=487658abe0d098bcedfd3eb2ea297ae7f6067923
Submitter: Zuul
Branch: master

commit 487658abe0d098bcedfd3eb2ea297ae7f6067923
Author: Chris MacNaughton <email address hidden>
Date: Thu May 31 15:50:06 2018 +0200

Add action to zap disk(s)

    This action includes configuration for disk(s) to
    zap, as well as an additional required flag for
    the administrator to acknowledge pending data loss

Change-Id: I3106e2f10cf132a628aad025f73161b04215598e
Related-Bug: #1698154

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Fix merged to charm-ceph-osd (master)

#12

Reviewed: https://review.openstack.org/571452
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=22ce311b0b0b3a88abc92f7e5241bf9fd0ab947d
Submitter: Zuul
Branch: master

commit 22ce311b0b0b3a88abc92f7e5241bf9fd0ab947d
Author: Ryan Beisner <email address hidden>
Date: Thu May 31 07:00:33 2018 -0500

No reformat

    Do not reformat devices. A subsequent change will be necessary
    to account for conditions where a reformat is still desired,
    such as a set of blocking states and user-driven actions.

Partial-bug: #1698154

Depends-On: I90a866aa138d18e4242783c42d4c7c587f696d7d
Change-Id: I3a41ab38e7a1679cf4f5380a7cc56556da3aaf2b

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04:

#13

Reviewed: https://review.openstack.org/571690
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=352d6993870be2547f37463e4e3cffc7605f749c
Submitter: Zuul
Branch: master

commit 352d6993870be2547f37463e4e3cffc7605f749c
Author: Frode Nordahl <email address hidden>
Date: Fri Jun 1 12:15:01 2018 +0200

Add pre-flight check for device pristinity

Add `non-pristine` key to `list-disks` action.

No longer attempt to do initializtion of `osd-journal` devices.

Make py27 test noop

Flip pep8 test to py3

    Partial-Bug: #1698154
    Change-Id: I0ca574fa7f0683b4e8a693b9f62fbf6b39689789
    Depends-On: I90a866aa138d18e4242783c42d4c7c587f696d7d

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-06: Related fix proposed to charm-ceph-osd (master)

#14

Related fix proposed to branch: master
Review: https://review.openstack.org/572696

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-07: Related fix merged to charm-ceph-osd (master)

#15

Reviewed: https://review.openstack.org/572696
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=1f68a5874e66061e61dea03f8e8aa0aded7385c9
Submitter: Zuul
Branch: master

commit 1f68a5874e66061e61dea03f8e8aa0aded7385c9
Author: Frode Nordahl <email address hidden>
Date: Wed Jun 6 09:59:32 2018 +0200

Update tests to not set `fsid` and `monitor-secret`

Get `fsid` from leader settings on ceph-mon unit where needed
for validation.

Change-Id: I751ecff76873a599c0d03ec1308e30e615e38aa8
Related-Bug: #1698154

Frode Nordahl (fnordahl) on 2018-06-07

Changed in charm-ceph-osd:
status:	In Progress → Fix Committed

David Ames (thedac) on 2018-06-11

Changed in charm-ceph-osd:
status:	Fix Committed → Fix Released

Ceph OSD Charm

race condition causes reformat of osd - ceph-osd processes and charm config-changed hook upon boot

Bug Description

Other bug subscribers

Remote bug watches