Need for per-unit blacklist of osd-devices

Bug #1730267 reported by Frode Nordahl
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
Medium
Frode Nordahl

Bug Description

Over time nodes running Ceph OSDs will eventually grow bad disks. While Ceph itself handles the bulk of this problem domain, the charm plays a important part in the operational handling of this.

Having a node with a bad, but still present to the operating system disk device, can in some circumstances lead to complications for Juju and ceph-osd charm operation of said node.

During the ceph-osd charm handling of 'config-changed' events the charm will make an attempt at initialize and format any currently not active disk devices listed in the 'osd-devices' config option. When this operation fails due to bad disk the charm will end up in a error state, leaving the node inoperable through Juju.

At initial deployment time, getting a error for unsuccessful initialization is useful and expected. Having a ceph-osd unit in error state due to a bad disk further down the road is not desirable. Note that it may not make operational sense to swap the physical disk immediately and the node should be operable even with a bad disk.

There currently exist three config options that could have an effect on this behaviour: 'osd-reformat', 'ignore-device-errors' and 'osd-devices'.

However config options is set at the application-level in the Juju model and in a large cluster it may not be desirable to change any of these cluster-wide as that will affect how the rest of the cluster is managed and operated.

Suggestion:
- Add device blacklist handling to ceph-osd charm
- The list could be managed using actions 'blacklist-add-disk', 'blacklist-remove-disk'
- The blacklisted disks could be listed under the 'blacklisted' key returned by the existing 'list-disks' action

Tags: sts
Frode Nordahl (fnordahl)
tags: added: sts
Frode Nordahl (fnordahl)
Changed in charm-ceph-osd:
assignee: nobody → Frode Nordahl (fnordahl)
Frode Nordahl (fnordahl)
description: updated
Frode Nordahl (fnordahl)
Changed in charm-ceph-osd:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.openstack.org/517989

Changed in charm-ceph-osd:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.openstack.org/517989
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=c4d4e42c1a8cbda797834c8c0dd7d264c6abfdd5
Submitter: Zuul
Branch: master

commit c4d4e42c1a8cbda797834c8c0dd7d264c6abfdd5
Author: Frode Nordahl <email address hidden>
Date: Mon Nov 6 15:03:24 2017 +0100

    Add actions to blacklist osd-devices

    The blacklist actions allow for adding and removing devices
    to a unit-local list of devices to be skipped during osd
    initialization. This list will be used to override the
    application level, and thereby deployment wide, 'osd-devices'
    configuration option on a individual unit basis.

    The pre-existing list-disk action is extended to return
    list of blacklisted devices under the 'blacklist' key.

    Change-Id: I28a3c5d6076fb496dead3fe3387d9bbbbe9ec083
    Closes-Bug: #1730267

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Frode Nordahl (fnordahl)
Changed in charm-ceph-osd:
milestone: none → 17.11
James Page (james-page)
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.