Recovery operation takes high priority than client I/O with mclock scheduler

Bug #2013960 reported by Ponnuvel Palaniyappan
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Starting with Quincy, the mclock_scheduler is used as default for OSD op queue. However, the default recovery settings are very high that it the impact on client I/O can be really high depending on the amount of recovery operations needed to be done. This is a bug and has been fixed in 'main' branch and backported to Quincy [0][1].

There's no upstream Quincy release with this fix yet. 17.2.6 will have this fix which is undergoing QA at the moment.

Workaround:

There are couple of ways this can be mitigated in Quincy.

1. Use the 'wpq' as osd_op_queue. This has been the default in previous releases and works just fine. This will require restarting OSDs.
Steps:
i. Change osd_op_queue to 'wpq': `sudo ceph config set osd osd_op_queue wpq`
ii. Rolling restart of all the OSDs (with `noout` & `norebalance` flags set)
iii. Check that 'wpq' is now set: `ceph tell osd.* config get osd_op_queue`

2. Stick with mclock scheduler but use custom mclock profile. This will allow users to be able to modify recovery parameters.
```
osd_mclock_scheduler_background_recovery_res
osd_mclock_scheduler_background_recovery_wgt
osd_mclock_scheduler_background_recovery_lim
```
To be able to use this option, 17.2.4 or later is required due to another
bug [2]. So probably it's both simpler & straightforward to stick with 'wpq' until the fix for [0] is available or 17.2.6 is out.

NB: This affects Quincy release only. Older (pacific, octopus, et all) use
'wpq' and as such the recovery parameters can be modified as usual. Only
starting from Quincy this has changed.

[0] https://tracker.ceph.com/issues/57529
[1] https://github.com/ceph/ceph/pull/48226
[2] https://tracker.ceph.com/issues/55153

Tags: sts
tags: added: sts
description: updated
description: updated
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote (last edit ):

Ceph 17.2.6 has been released now.

https://ceph.io/en/news/blog/2023/v17-2-6-quincy-released/

which has the fix for this issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu):
status: New → Confirmed
description: updated
description: updated
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

The Quincy point release 17.2.6 (which has fix for this) is being tracked via https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/2018929

description: updated
description: updated
description: updated
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Quincy 17.2.6 has the fix for this issue which has been released with the following packages:
```
 ceph | 17.2.6-0ubuntu0.22.04.1~cloud0 | yoga | focal-updates | source
 ceph | 17.2.6-0ubuntu0.22.04.1 | jammy-updates | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
 ceph | 17.2.6-0ubuntu0.22.10.1 | kinetic-updates | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
 ceph | 17.2.6-0ubuntu0.23.04.1 | lunar-updates | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
 ceph | 17.2.6-0ubuntu1 | mantic | source, amd64, arm64, armhf, ppc64el, riscv64, s390x
```

So marking this as 'fix released'.

Changed in ceph (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.