after zap-disk of all ceph paths, remove-unit will re-configure ceph-osd services/mounts

Bug #1885195 reported by Drew Freiberger
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Committed
Medium
Unassigned

Bug Description

In a bionic distro/queens ceph-osd-294 charm environment, with nova-compute and ceph-osd hulk-smashed on each metal, I am working through a process where I'm trying to attempt to remove vault osd encryption from the deployment.

My rough process is:

juju config ceph-osd --reset osd-encrypt

Then, for each ceph-osd node:

ceph osd out $id (for each osd device on a host)
Wait for ceph to settle rebalance
ceph osd purge $id --yes-i-really-mean-it
juju zap-disk ceph-osd/$unit_id zap-devices="$(juju config ceph-osd osd-devices)" yes-i-really-mean-it=true
juju remove-unit ceph-osd/$unit_id
ceph osd crush remove $hostname
upgrade kernel on the machine
reboot the machine
juju add-unit ceph-osd --to $machine_id

When I get to 'ceph osd crush remove $hostname', I find that the host still exists in the OSD tree, and that the OSD devices have been re-added by the ceph-osd charm upon juju remove-unit since the osd-devices were clean and ready to be re-formatted. I think this is caused by the reactive framework not recognizing the intent to remove the unit, seeing the unconfigured disks, and trying to re-configure it's relations with ceph-mon, etc.

I believe some investigation may be necessary to trap hooks like ceph-mon-relation-departed/broken and stop to ensure that the hooks firing during unit removal do not reconfigure ceph services upon departure of the juju agent from the host.

As I'm working through this on a site with many hosts to perform this change on, I'll try to capture either a workaround or a clean process and logs around it to determine why this may be happening.

I suspect that just zapping the disks and running a config-changed may be more prudent than remove/install unit, as the disks were re-added without encryption.

Tags: scaleback
tags: added: scaleback
Revision history for this message
Drew Freiberger (afreiberger) wrote :

The mon-relation-departed triggers a ceph bootstrap and disk rescan:

unit-ceph-osd-15: 18:29:05 INFO unit.ceph-osd/15.juju-log mon:50: ceph bootstrapped, rescanning disks

https://pastebin.canonical.com/p/Cf6Q79xVSS/

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I've added field-medium, as this is affecting production maintenance to remove encrypted ceph-osd units and re-install unencrypted ceph-osd units.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thank you for the scenario and details. As you've noted, #scaleback is essentially a NotImplemented gap. The bug is tagged appropriately to triage into the product development backlog. Any additional information or findings will of course be appreciated.

Changed in charm-ceph-osd:
importance: Undecided → Wishlist
Revision history for this message
Ryan Beisner (1chb1n) wrote :

For the record, the established working assumption is that ceph charms will be deployed onto clean systems, and that any decommissioned systems will be cleaned up outside outside of the scope of the charm.

Revision history for this message
James Troup (elmo) wrote :

For the record, that's not a valid or realistic assumption.

Ryan Beisner (1chb1n)
Changed in charm-ceph-osd:
importance: Wishlist → Medium
importance: Medium → Wishlist
importance: Wishlist → Medium
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I can't think of a single logical reason that an OSD *should* be bootstrapped on the mon-relation-departed hook. Confusingly in the charm-ceph-mon code [0], the bootstrap is gated upon get_fsid(), get_auth(), and relation_get('osd_bootstrap_key') evaluating to a Truthy value. get_fsid() and get_auth() will because the files they reference are still persisted on disk and that's not a surprise. Since this is a relation-departed hook, this means that the charm is still able to read data from the relation since it has not broken completely (the relation-broken).

This can easily be resolved by removing the decorator for the mon-relation-departed on this hook.

Changed in charm-ceph-osd:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)
Changed in charm-ceph-osd:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/841790
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/82f576ac30ad7d8e0da4d9212b29ecbeb76294a9
Submitter: "Zuul (22348)"
Branch: master

commit 82f576ac30ad7d8e0da4d9212b29ecbeb76294a9
Author: Billy Olsen <email address hidden>
Date: Fri May 13 11:40:39 2022 -0700

    Don't bootstrap osds on mon-relation-departed hook

    The charm attempts to bootstrap OSDs on both the mon-relation-changed
    and the mon-relation-departed hooks. There is no logical reason that
    the OSDs should be bootstrapped in the -departed hook.

    Change-Id: I79a790291b0e361d2748d6bed8c989d16ad36daf
    Closes-Bug: #1885195

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.