Ceph OSD Charm

Add "osd op queue cut off" charm setting, default to high

Bug #1888909 reported by Dan Hill on 2020-07-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	New	Undecided	Unassigned

Bug Description

[Impact]
OSD heartbeat and map updates can get starved during backfill and recovery scenarios. This issue becomes significantly more pronounced with fast storage devices. The OSD's peers effectively DDoS the strict priority queue, stalling critical time-sensitive operations. This can result in the OSD being marked down when it is unable to respond to MON heartbeat requests.

The solution is to change "osd op queue cut off" to "high" shifting recovery operations out of the strict queue and into the weighted priority queue. The recovery ops will still be serviced at a higher priority, but will not starve critical cluster communications.

[Test Case]
This was encountered while attempting to perform an upgrade from 12.2.11 to 12.2.12 on an all-SSD cluster while under heavy workloads.

[Other Info]
The "high" cut-off was intended to be the default in Luminous. The author of the weighted priority queue discusses the cut-off in a ceph-users thread [0]. The "osd op queue cut off" default was set to "high" by upstream in the Octopus 15.2.0 release [1].

[0] https://www.spinics.net/lists/ceph-users/msg54860.html
[1] https://github.com/ceph/ceph/pull/30441

Tags:

Dan Hill (hillpd) on 2020-07-24

tags:

added: sts

Revision history for this message

Dan Hill (hillpd) wrote on 2020-08-01:

The default osd op queue is 'wpq' for Luminous and above. With the weighted priority queue set, you can use the following workaround to address this issue:

Update the juju configuration to persist the new cut off setting:
juju config ceph-osd config-flags='{"osd": {"osd op queue cut off": "high"}}'

Take care when modifying config-flags to check for any existing flags that may have been set. The new flag must be merged with any that are already configured.

Then modify the run-time config by running the following command on the ceph-mon:
sudo ceph tell osd.* config set osd_op_queue_cut_off high

Revision history for this message

Dan Hill (hillpd) wrote on 2020-12-16:

Note: osd_op_queue_cut_off can not be modified at run-time. The ceph-osd service must be restarted to apply the change.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.