Ceph Monitor Charm

During expansion of environments, ceph-mon charm should warn about or manage PG counts

Bug #1870588 reported by Drew Freiberger on 2020-04-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph Monitor Charm	Triaged	Low	Unassigned

Bug Description

Several clouds have experienced PG per OSD starvation during expansion of nodes in a cloud.

The ceph-mon charm creates pools based on expected % util per pool and expected osd count and replica count (typically 3), but does not manage these numbers beyond initial creation of the pools.

Imagine a healthy cloud that starts with an expected-osd-count = 100 that goes on to expand to actually running 400 OSDS by adding 4x the number of initial nodes. You'll find yourself adding the 300-400th osds in a situation where Luminous is going to start going into HEALTH_ERR due to PG starvation with <20 pgs per OSD whereas you may have had 80-120 pgs per osd when you built the cloud with 100 OSDs.

As ceph-osd nodes are added to a stabile cloud, ceph-mon could warn the juju operator before allowing the additional OSDs to be added to the cluster so that ceph doesn't go into a critical state from PG starvation.

The proper workaround is for the operator of a cloud to pre-expand pg_num and pgp_num of pools expected to grow with the expansion based on the newly added OSD counts before adding the ceph-osd nodes.

Further details about pg_num configuration for pools can be found at https://ceph.com/pgcalc/.

Tags:

James Hebden (ec0) on 2020-04-07

tags:

added: canonical-bootstack

Revision history for this message

Drew Freiberger (afreiberger) wrote on 2020-04-15:

Besides the starvation issue of adding OSDS, we should also be warning about creating pools that would add too many pgs per osd such as having two different cinder-ceph charms set to pool weights of 40% each. If you add another cinder-ceph charm after deployment without considering the pool weight, it can easily spike the number of pgs per OSD beyond what's recommended.

This becomes even more complicated to prevent when using class-based replication policies and having a cinder-ceph charm for a density pool as well as a separate cinder-ceph charm for a performance pool. The ceph-mon charm doesn't have the ability to distinguish separate crush policies and make calculations for varied OSD counts per pool.

It may be best to approach this as a charm maintenance state or nagios alert that notes any OSDs over 200pgs per osd or under 50pgs per OSD so that operators can react to misconfigured pg counts before going live with new environments.

Revision history for this message

James Page (james-page) wrote on 2021-11-01:

How does the behaviour of ceph change when the pg autoscaler is enabled? This was enabled on new installations for Nautilus or later.

I'd like to hope that the autoscaler is taking into account the actual size of the cluster and tuning the PG's to avoid OSD pg starvation but I might be wrong.

Revision history for this message

James Page (james-page) wrote on 2021-11-01:

Triaging this bug with Low priority - I can see how this might be an issue especially on older Ceph versions. Will spend some time to see how the autoscaler might help in this scenario for Nautilus and later.

Changed in charm-ceph-mon:
status:	New → Triaged
importance:	Undecided → Low

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.