--force on erasure code profiles cause weird behavior and ceph cluster to fail upgrade

Bug #1897517 reported by Pedro Guimarães
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
High
James Page
Charm Helpers
Fix Released
High
James Page
OpenStack Ceph-Proxy Charm
Fix Released
High
James Page

Bug Description

Hi OpenStack team,

This troubleshooting was done between billy-olsen, hillpd and me.

I am using the following bundle: https://pastebin.canonical.com/p/Y49S9qP48S/

In the upgrade, we move to the equivalent 20.08 charms for ceph-mon, nova-compute, cinder-ceph (which is 20.08/stable branch + EC pool commit).
Then, run upgrade to Train.

That fails with ceph-mon not being able to cluster back again & ceph-osds never finish to come back.

Looking closer to ceph-osds, one can see that they fail with core-dump.
Here is a snippet of the core-dump: https://pastebin.ubuntu.com/p/SrRDDKqVrp/

The line that draw team's attention was:
Sep 26 00:06:19 node02 ceph-osd[1874627]: /build/ceph-LYYpRh/ceph-14.2.11/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)
which should not fail with stripe_size==0.

On that ceph-osd log, we could see:
https://pastebin.ubuntu.com/p/WJ4jrrzFxJ/

Now, the pg18 was the EC pool pg; and it was missing 2 out of 11 placement groups (K+M=11 but I had only 9 OSDs). I changed that value on charm configs, post-deployment; then removed/readded the relation with ceph-mon.

I believe that was the moment where the --force was applied, following this code line:
https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/storage/linux/ceph.py#L1213

Which caused this issue later at upgrade.

Full logs of ceph-osd unit: https://drive.google.com/file/d/1kWDbwiiUJERZRu58vf-R_XQksflDL2fW/view?usp=sharing

--------------------------------------

My proposal:
We set the --force as a boolean option on
https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/storage/linux/ceph.py#L1065
Then, we pass force as a config parameter to ceph-mon, e.g.
allow-force-update-ec-profile

Anyway, I believe it is the best practice here to have force option default as false to avoid any weird errors later when using EC pools.

Revision history for this message
James Page (james-page) wrote :

I'm concerned that '--force' is a thing we should never do - I can't find good documentation on whether changing a profile that is in use is even a supported operation, or are we always going to hit issues as described here.

Maybe we should just make the function in the ceph helper treat erasure profiles as idempotent - i.e. if a profile already exists, don't try and change it.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I think I also prefer to not have a force option and go for charm logging a message to say e.g. "profile already exists - skipping update" since a profile is not supposed to be changed if it already exists and if it needs to be it should require operator intervention.

Revision history for this message
James Page (james-page) wrote :
Changed in charm-helpers:
status: New → In Progress
assignee: nobody → James Page (james-page)
Changed in charm-ceph-proxy:
status: New → In Progress
assignee: nobody → James Page (james-page)
Changed in charm-ceph-mon:
status: New → In Progress
assignee: nobody → James Page (james-page)
Changed in charm-helpers:
importance: Undecided → High
Changed in charm-ceph-proxy:
importance: Undecided → High
Changed in charm-ceph-mon:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754686

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-proxy (master)

Fix proposed to branch: master
Review: https://review.opendev.org/754687

James Page (james-page)
Changed in charm-helpers:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-proxy (master)

Reviewed: https://review.opendev.org/754687
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-proxy/commit/?id=8e74666ff3b61df27f03b8ceffbec274617b1485
Submitter: Zuul
Branch: master

commit 8e74666ff3b61df27f03b8ceffbec274617b1485
Author: James Page <email address hidden>
Date: Mon Sep 28 11:50:28 2020 +0100

    Make EC profiles immutable

    Changing an existing EC profile can have some nasty side effects
    including crashing OSD's (which is why its guarded with a --force).

    Update the ceph helper to log a warning and return if an EC profile
    already exists, effectively making them immutable and avoiding
    any related issues.

    Reconfiguration of a pool would be undertaking using actions:

      - create new EC profile
      - create new pool using new EC profile
      - copy data from old pool to new pool
      - rename old pool
      - rename new pool to original pool name

    this obviously requires an outage in the consuming application.

    Change-Id: I630f6b6c5e3c6dd252a85cd373d7e204b9e77245
    Closes-Bug: 1897517

Changed in charm-ceph-proxy:
status: In Progress → Fix Committed
Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/754686
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=56946244f9cb53c63bef2fcddabfe85b99b95b87
Submitter: Zuul
Branch: master

commit 56946244f9cb53c63bef2fcddabfe85b99b95b87
Author: James Page <email address hidden>
Date: Mon Sep 28 11:46:51 2020 +0100

    Make EC profiles immutable

    Changing an existing EC profile can have some nasty side effects
    including crashing OSD's (which is why its guarded with a --force).

    Update the ceph helper to log a warning and return if an EC profile
    already exists, effectively making them immutable and avoiding
    any related issues.

    Reconfiguration of a pool would be undertaking using actions:

      - create new EC profile
      - create new pool using new EC profile
      - copy data from old pool to new pool
      - rename old pool
      - rename new pool to original pool name

    this obviously requires an outage in the consuming application.

    Change-Id: Ifb3825750f0299589f404e06103d79e393d608f3
    Closes-Bug: 1897517

James Page (james-page)
Changed in charm-ceph-proxy:
milestone: none → 20.10
Changed in charm-ceph-mon:
milestone: none → 20.10
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
Changed in charm-ceph-proxy:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.