Ceph OSD Charm

after zap-disk of all ceph paths, remove-unit will re-configure ceph-osd services/mounts

Bug #1885195 reported by Drew Freiberger on 2020-06-25

18

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Fix Committed	Medium	Unassigned

Bug Description

In a bionic distro/queens ceph-osd-294 charm environment, with nova-compute and ceph-osd hulk-smashed on each metal, I am working through a process where I'm trying to attempt to remove vault osd encryption from the deployment.

My rough process is:

juju config ceph-osd --reset osd-encrypt

Then, for each ceph-osd node:

ceph osd out $id (for each osd device on a host)
Wait for ceph to settle rebalance
ceph osd purge $id --yes-i-really-mean-it
juju zap-disk ceph-osd/$unit_id zap-devices="$(juju config ceph-osd osd-devices)" yes-i-really-mean-it=true
juju remove-unit ceph-osd/$unit_id
ceph osd crush remove $hostname
upgrade kernel on the machine
reboot the machine
juju add-unit ceph-osd --to $machine_id

When I get to 'ceph osd crush remove $hostname', I find that the host still exists in the OSD tree, and that the OSD devices have been re-added by the ceph-osd charm upon juju remove-unit since the osd-devices were clean and ready to be re-formatted. I think this is caused by the reactive framework not recognizing the intent to remove the unit, seeing the unconfigured disks, and trying to re-configure it's relations with ceph-mon, etc.

I believe some investigation may be necessary to trap hooks like ceph-mon-relation-departed/broken and stop to ensure that the hooks firing during unit removal do not reconfigure ceph services upon departure of the juju agent from the host.

As I'm working through this on a site with many hosts to perform this change on, I'll try to capture either a workaround or a clean process and logs around it to determine why this may be happening.

I suspect that just zapping the disks and running a config-changed may be more prudent than remove/install unit, as the disks were re-added without encryption.

Tags:

Drew Freiberger (afreiberger) on 2020-06-25

tags:

added: scaleback

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-06-26:

#1

The mon-relation-departed triggers a ceph bootstrap and disk rescan:

unit-ceph-osd-15: 18:29:05 INFO unit.ceph-osd/15.juju-log mon:50: ceph bootstrapped, rescanning disks

https://pastebin.canonical.com/p/Cf6Q79xVSS/

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-06-26:

#2

I've added field-medium, as this is affecting production maintenance to remove encrypted ceph-osd units and re-install unencrypted ceph-osd units.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2020-06-26:

#3

Thank you for the scenario and details. As you've noted, #scaleback is essentially a NotImplemented gap. The bug is tagged appropriately to triage into the product development backlog. Any additional information or findings will of course be appreciated.

Changed in charm-ceph-osd:
importance:	Undecided → Wishlist

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2020-06-26:

#4

For the record, the established working assumption is that ceph charms will be deployed onto clean systems, and that any decommissioned systems will be cleaned up outside outside of the scope of the charm.

Revision history for this message

James Troup (elmo) wrote on 2020-06-26:

#5

For the record, that's not a valid or realistic assumption.

Ryan Beisner (1chb1n) on 2020-06-26

Changed in charm-ceph-osd:
importance:	Wishlist → Medium
importance:	Medium → Wishlist
importance:	Wishlist → Medium

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2022-05-13:

#6

I can't think of a single logical reason that an OSD *should* be bootstrapped on the mon-relation-departed hook. Confusingly in the charm-ceph-mon code [0], the bootstrap is gated upon get_fsid(), get_auth(), and relation_get('osd_bootstrap_key') evaluating to a Truthy value. get_fsid() and get_auth() will because the files they reference are still persisted on disk and that's not a surprise. Since this is a relation-departed hook, this means that the charm is still able to read data from the relation since it has not broken completely (the relation-broken).

This can easily be resolved by removing the decorator for the mon-relation-departed on this hook.

Changed in charm-ceph-osd:
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-13: Fix proposed to charm-ceph-osd (master)

#7

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/841790

Changed in charm-ceph-osd:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-13: Fix merged to charm-ceph-osd (master)

#8

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/841790
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/82f576ac30ad7d8e0da4d9212b29ecbeb76294a9
Submitter: "Zuul (22348)"
Branch: master

commit 82f576ac30ad7d8e0da4d9212b29ecbeb76294a9
Author: Billy Olsen <email address hidden>
Date: Fri May 13 11:40:39 2022 -0700

Don't bootstrap osds on mon-relation-departed hook

    The charm attempts to bootstrap OSDs on both the mon-relation-changed
    and the mon-relation-departed hooks. There is no logical reason that
    the OSDs should be bootstrapped in the -departed hook.

Change-Id: I79a790291b0e361d2748d6bed8c989d16ad36daf
Closes-Bug: #1885195

Changed in charm-ceph-osd:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.