OpenStack Ceph-Proxy Charm

--force on erasure code profiles cause weird behavior and ceph cluster to fail upgrade

Bug #1897517 reported by Pedro Guimarães on 2020-09-28

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Ceph Monitor Charm	Fix Released	High	James Page	Ceph Monitor Charm 20.10
Charm Helpers	Fix Released	High	James Page
OpenStack Ceph-Proxy Charm	Fix Released	High	James Page	OpenStack Ceph-Proxy Charm 20.10

Bug Description

Hi OpenStack team,

This troubleshooting was done between billy-olsen, hillpd and me.

I am using the following bundle: https://pastebin.canonical.com/p/Y49S9qP48S/

In the upgrade, we move to the equivalent 20.08 charms for ceph-mon, nova-compute, cinder-ceph (which is 20.08/stable branch + EC pool commit).
Then, run upgrade to Train.

That fails with ceph-mon not being able to cluster back again & ceph-osds never finish to come back.

Looking closer to ceph-osds, one can see that they fail with core-dump.
Here is a snippet of the core-dump: https://pastebin.ubuntu.com/p/SrRDDKqVrp/

The line that draw team's attention was:
Sep 26 00:06:19 node02 ceph-osd[1874627]: /build/ceph-LYYpRh/ceph-14.2.11/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
which should not fail with stripe_size==0.

On that ceph-osd log, we could see:
https://pastebin.ubuntu.com/p/WJ4jrrzFxJ/

Now, the pg18 was the EC pool pg; and it was missing 2 out of 11 placement groups (K+M=11 but I had only 9 OSDs). I changed that value on charm configs, post-deployment; then removed/readded the relation with ceph-mon.

I believe that was the moment where the --force was applied, following this code line:
https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/storage/linux/ceph.py#L1213

Which caused this issue later at upgrade.

Full logs of ceph-osd unit: https://drive.google.com/file/d/1kWDbwiiUJERZRu58vf-R_XQksflDL2fW/view?usp=sharing

--------------------------------------

My proposal:
We set the --force as a boolean option on
https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/storage/linux/ceph.py#L1065
Then, we pass force as a config parameter to ceph-mon, e.g.
allow-force-update-ec-profile

Anyway, I believe it is the best practice here to have force option default as false to avoid any weird errors later when using EC pools.

Revision history for this message

James Page (james-page) wrote on 2020-09-28:

I'm concerned that '--force' is a thing we should never do - I can't find good documentation on whether changing a profile that is in use is even a supported operation, or are we always going to hit issues as described here.

Maybe we should just make the function in the ceph helper treat erasure profiles as idempotent - i.e. if a profile already exists, don't try and change it.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2020-09-28:

I think I also prefer to not have a force option and go for charm logging a message to say e.g. "profile already exists - skipping update" since a profile is not supposed to be changed if it already exists and if it needs to be it should require operator intervention.

Revision history for this message

James Page (james-page) wrote on 2020-09-28:

https://github.com/juju/charm-helpers/pull/513

Changed in charm-helpers:
status:	New → In Progress
assignee:	nobody → James Page (james-page)
Changed in charm-ceph-proxy:
status:	New → In Progress
assignee:	nobody → James Page (james-page)
Changed in charm-ceph-mon:
status:	New → In Progress
assignee:	nobody → James Page (james-page)
Changed in charm-helpers:
importance:	Undecided → High
Changed in charm-ceph-proxy:
importance:	Undecided → High
Changed in charm-ceph-mon:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754686

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix proposed to charm-ceph-proxy (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754687

James Page (james-page) on 2020-09-28

Changed in charm-helpers:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix merged to charm-ceph-proxy (master)

Reviewed: https://review.opendev.org/754687
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-proxy/commit/?id=8e74666ff3b61df27f03b8ceffbec274617b1485
Submitter: Zuul
Branch: master

commit 8e74666ff3b61df27f03b8ceffbec274617b1485
Author: James Page <email address hidden>
Date: Mon Sep 28 11:50:28 2020 +0100

Make EC profiles immutable

Changing an existing EC profile can have some nasty side effects
including crashing OSD's (which is why its guarded with a --force).

    Update the ceph helper to log a warning and return if an EC profile
    already exists, effectively making them immutable and avoiding
    any related issues.

Reconfiguration of a pool would be undertaking using actions:

      - create new EC profile
      - create new pool using new EC profile
      - copy data from old pool to new pool
      - rename old pool
      - rename new pool to original pool name

this obviously requires an outage in the consuming application.

Change-Id: I630f6b6c5e3c6dd252a85cd373d7e204b9e77245
Closes-Bug: 1897517

Changed in charm-ceph-proxy:
status:	In Progress → Fix Committed
Changed in charm-ceph-mon:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-28: Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/754686
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=56946244f9cb53c63bef2fcddabfe85b99b95b87
Submitter: Zuul
Branch: master

commit 56946244f9cb53c63bef2fcddabfe85b99b95b87
Author: James Page <email address hidden>
Date: Mon Sep 28 11:46:51 2020 +0100

Make EC profiles immutable

Changing an existing EC profile can have some nasty side effects
including crashing OSD's (which is why its guarded with a --force).

    Update the ceph helper to log a warning and return if an EC profile
    already exists, effectively making them immutable and avoiding
    any related issues.

Reconfiguration of a pool would be undertaking using actions:

      - create new EC profile
      - create new pool using new EC profile
      - copy data from old pool to new pool
      - rename old pool
      - rename new pool to original pool name

this obviously requires an outage in the consuming application.

Change-Id: Ifb3825750f0299589f404e06103d79e393d608f3
Closes-Bug: 1897517

James Page (james-page) on 2020-09-28

Changed in charm-ceph-proxy:
milestone:	none → 20.10
Changed in charm-ceph-mon:
milestone:	none → 20.10

Alex Kavanagh (ajkavanagh) on 2020-11-02

Changed in charm-ceph-mon:
status:	Fix Committed → Fix Released
Changed in charm-ceph-proxy:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.