After replacing ceph-osd disk, blank directories /var/lib/ceph/osd/ceph-X that are unmounted cause hangup in ceph-osd upgrade

Bug #1934938 reported by Drew Freiberger
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Luciano Lo Giudice

Bug Description

On one unit that was performing a luminous->mimic upgrade on charm version 21.04, experienced an infinite loop of the following in the unit debug-logs:

unit-ceph-osd-13: 20:42:16 DEBUG unit.ceph-osd/13.juju-log Command '['ceph', 'daemon', '/var/run/ceph/ceph-osd.51.asok', 'status']' returned non-zero exit status 22.
unit-ceph-osd-13: 20:42:16 DEBUG unit.ceph-osd/13.config-changed admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

This exit status 22 is indicating the OSD doesn't exist or isn't running.

When investigating, I found the following /var/lib/ceph/osd/ directories and mounts:

myhost:/var/lib/ceph/osd# ls -al
total 16
drwxr-xr-x 9 ceph ceph 4096 Jul 2 2020 .
drwxr-x--- 11 ceph ceph 4096 Sep 9 2019 ..
drwxr-xr-x 2 ceph ceph 4096 Sep 9 2019 ceph-51
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-56
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-60
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-64
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-68
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-74
drwxrwxrwt 2 ceph ceph 200 Jun 28 02:35 ceph-89
-rw------- 1 ceph ceph 69 Sep 9 2019 ceph.client.osd-upgrade.keyring

myhost:/var/lib/ceph/osd# mount|grep ceph
tmpfs on /var/lib/ceph/osd/ceph-74 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-56 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-68 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-64 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-60 type tmpfs (rw,relatime)
tmpfs on /var/lib/ceph/osd/ceph-89 type tmpfs (rw,relatime)

running rmdir /var/lib/ceph/osd/ceph-51 killing, and restarting the agent allowed the command to continue. I think somehow there needs to be better logic of configured/available OSDs to cover poor cleanup after replacing an OSD.

tags: added: openstack-upgrade
Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (master)
Changed in charm-ceph-osd:
status: Triaged → In Progress
Changed in charm-ceph-osd:
assignee: nobody → Luciano Lo Giudice (lmlogiudice)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/806644
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/93e9885aa7cc4c4e0c35e065abbdcad4d7c6279d
Submitter: "Zuul (22348)"
Branch: master

commit 93e9885aa7cc4c4e0c35e065abbdcad4d7c6279d
Author: Luciano Lo Giudice <email address hidden>
Date: Mon Aug 30 18:13:43 2021 -0300

    Only consider mounted OSD directories

    When gathering the list of local OSD ids, the charm would consider
    the entries under '/var/lib/ceph/osd/ceph-XXX' where 'XXX" was the
    OSD id. However, if an entry under that directory isn't mounted,
    then the OSD that would represent that entry should be discarded,
    as it's no longer active. This patchset thus filters those entries
    by looking for them in the mount points.

    Closes-Bug: #1934938
    Change-Id: I69c6356e450cc0c96de4afe571b438d4a2ea5177

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Felipe Reyes (freyes)
Changed in charm-ceph-osd:
milestone: none → 21.10
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.