Ceph OSD Charm

charm is blocked when non-pristine devices are detected

Bug #1988088 reported by Luciano Lo Giudice on 2022-08-29

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Fix Committed	Medium	Peter Sabaini

Bug Description

The charm can enter the blocked status with the message that non-pristine disks have been detected, and stay that way until an operator manually intervenes. This happens because once the blocked status is reached, successive reassessments are made with a broader list of devices than the originally checked. i.e: https://opendev.org/openstack/charm-ceph-osd/src/branch/master/hooks/ceph_hooks.py#L859

The above check doesn't consider the list of configured devices. A proposed fix would thus be that in addition to not considering the journal devices, as is done above, it should _also_ not consider the list of devices in the config parameter `osd-devices`.

Peter Sabaini (peter-sabaini) on 2022-10-07

Changed in charm-ceph-osd:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2022-10-25:

https://opendev.org/openstack/charm-ceph-osd/src/commit/0b8a5838925a65c4c6eaee54c1967e7228b545d3/hooks/ceph_hooks.py#L859 is a better link to the code that Luciano is referencing above

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2023-01-04:

Until this is resolved in the charms, running status-set on the affected unit with a message that doesn't start with 'Non-pristine' should resolve the blocked state

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2023-01-04:

Running hooks/assess-status on the affected unit after the status should clear up the status and get to the correct state

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2023-01-04:

And, for the record, this is a purely display, informative Juju status issue that doesn't affect the OSD functionality

Revision history for this message

Andre Ruiz (andre-ruiz) wrote on 2023-01-04 (last edit on 2023-01-04):

Considering that this is just a charm status problem, and the real devices were correctly configured and are working, a quick workaround is to run "status-set" on the affected units to change their status to "active" so a deployment with fce can continue (it uses juju-wait between all steps and any blocks in the charms will break the flow).

juju run --unit ceph-osd/2 "status-set active '<optional message>'"

This of course will not last long since the problem will come back in the next update-status but may be enough to get going in a test.

Adding ceph-osd to wait_exclude temporarily may be a better option.

Revision history for this message

Andre Ruiz (andre-ruiz) wrote on 2023-01-04:

Adding field-medium to this bug. It's not a blocker because of the above workaround, but it definitely needs some priority.

Revision history for this message

Andre Ruiz (andre-ruiz) wrote on 2023-01-06 (last edit on 2023-01-06):

Chris, what is "Running hooks/assess-status"? It does not seem to be a hook that I can run with juju run .... "hooks/..."

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2023-01-06:

I think hooks/assess-status should be hooks/update-status

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2023-01-06:

indeed - update-status! Sorry for the confusion!

Revision history for this message

Chris Johnston (cjohnston) wrote on 2023-01-06:

#10

I end up with:

ceph-osd/1* blocked idle 4 1.2.3.4 No block devices detected using current configuration

Revision history for this message

Andre Ruiz (andre-ruiz) wrote on 2023-01-11:

#11

Changing from field-medium to field-high. Will not raise to field-critical since there is a workaround, but this is making things very complicated during automated deployments with FCE and also when handing over clouds to support.

Revision history for this message

Jeff Hillman (jhillman) wrote on 2023-01-24:

#12

I followed the `set-status active` and then `hooks/update-status` steps and ended up in the same boat as Chris with 'No block devices detected using current configuration'

I went a step further and found that there were some residual ceph LVM configs on the 3 nodes and manually removed them.

Next I zapped the disks and then ran the add-disk action and it got to the point of attempting to initial the disks but ended up back at no block devices detected.

I can see that all disks now have LVM configs on all the nodes. But nothing more.

This is a new installation, using ceph machines as manual provider (small microcloud with juju bootstrapped into LXD cluster across the 3 ceph nodes)

This is blocking a customer deployment.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-01-24:

#13

"non-pristine devices" could be caused by many reasons.

What we know about potential causes so far are:

1. insufficient resource or misconfiguration
e.g. "mon-relation-changed Volume group "ceph-db-XYZ" has insufficient free space (255 extents): 256 required."

2. bluefs_buffered_io related
e.g. "mon-relation-changed stderr: (...) -1 OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error"

For 2., there are some upstream reports, but I'm not personally familiar with what exact condition brings this situation.
- https://tracker.ceph.com/issues/54019
- https://tracker.ceph.com/issues/51034#note-12
- https://github.com/ceph/ceph/pull/49431

config-flags: '{"osd": { "bluefs_buffered_io": false}}' could be used to confirm if it's about buffered io when you see the error above.

In any case, logs would be important to see what's going on.

Revision history for this message

Jeff Hillman (jhillman) wrote on 2023-01-24:

#14

I was able to get this working, i think because i have manual provider and not reprovisioning the machines is what gave me issues.

my steps to get a clean build working were:

remove all LVs, VGs, and PGs from ceph
apt purge all ceph packages
rm -rf /var/lib/ceph
dd if=/dev/null of=/dev/{osd-devices} bs=1M count=500

a bit extreme, but faster than reprovisioning.

Revision history for this message

Pedro Victor Lourenço Fragola (pedrovlf) wrote on 2023-03-07:

#15

I used the workaround from comment #2 and it fixed the issue.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-14: Fix proposed to charm-ceph-osd (master)

#16

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/877353

Changed in charm-ceph-osd:
status:	Triaged → In Progress

Peter Sabaini (peter-sabaini) on 2023-03-14

Changed in charm-ceph-osd:
assignee:	nobody → Peter Sabaini (peter-sabaini)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-27: Fix merged to charm-ceph-osd (master)

#17

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/877353
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/22dfd2cc8f9f9a1bc14bfcb827fad47a9c97ac17
Submitter: "Zuul (22348)"
Branch: master

commit 22dfd2cc8f9f9a1bc14bfcb827fad47a9c97ac17
Author: Peter Sabaini <email address hidden>
Date: Tue Mar 14 10:55:08 2023 +0100

Fix pristine status

Only check configured devices instead of all system devices and don't check already processed devices when computing pristine status

    Closes-Bug: #1988088
    Change-Id: Ia6bf7a5b7abddb72c3ec61fd9e02daf42e94c2da
    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1025

Changed in charm-ceph-osd:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-31: Fix proposed to charm-ceph-osd (stable/quincy.2)

#18

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/879099

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-03-31: Fix merged to charm-ceph-osd (stable/quincy.2)

#19

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/879099
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/d081e23de9a5ce5e055c275f61b45cb48ec19ff6
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit d081e23de9a5ce5e055c275f61b45cb48ec19ff6
Author: Peter Sabaini <email address hidden>
Date: Tue Mar 14 10:55:08 2023 +0100

Fix pristine status

Only check configured devices instead of all system devices and don't check already processed devices when computing pristine status

    Closes-Bug: #1988088
    Change-Id: Ia6bf7a5b7abddb72c3ec61fd9e02daf42e94c2da
    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/1032
    (cherry picked from commit 22dfd2cc8f9f9a1bc14bfcb827fad47a9c97ac17)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.