Please backport fixes from 10.2.3 and tip for RadosGW
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Mitaka |
Fix Released
|
Critical
|
James Page | ||
ceph (Ubuntu) |
Fix Released
|
Critical
|
James Page | ||
Xenial |
Fix Released
|
Critical
|
James Page | ||
Yakkety |
Fix Released
|
Critical
|
James Page |
Bug Description
[Impact]
In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats. As a result, clusters go read-only due to data availability.
[Test Case]
Non-trivial to reproduce - see original bug report.
[Regression Potential]
Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result. This is configurable so can be tuned on a per environment basis.
The patch has been accepted into the Ceph master branch, but is not currently targetted as a stable fix for Jewel.
>> Original Bug Report <<
We've run into significant issues with RadosGW at scale; we have a
customer who has ½ billion objects in ~20Tb of data and whenever they
lose an OSD for whatever reason, even for a very short period of time,
ceph was taking hours and hours to recover. The whole time it was
recovering requests to RadosGW were hanging.
I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk:
* d/p/fix-
* d/p/only-
* d/p/limit-
The 2 from 10.2.3 are because pg_temp was implicated in one of the
longer outages we had.
The last one is what I think actually got us to a point where ceph was
stable and I found it via the following URL chain:
http://
-> http://
-> https:/
-> https:/
With these 3 patches applied the customer has been stable for 4 days
now but I've yet to restart the entire cluster (only the stuck OSDs)
so it's hard to be completely sure that all our issues are resolved
but also which of the patches fixed things.
I've attached the debdiff I used for reference.
Changed in ceph (Ubuntu): | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in cloud-archive: | |
status: | New → Invalid |
Changed in ceph (Ubuntu Xenial): | |
status: | New → Triaged |
importance: | Undecided → Critical |
Changed in ceph (Ubuntu Xenial): | |
assignee: | nobody → James Page (james-page) |
Changed in ceph (Ubuntu Yakkety): | |
assignee: | nobody → James Page (james-page) |
description: | updated |
tags: | added: sts |
tags: |
added: verification-done removed: verification-needed |
tags: |
added: verification-mitaka-done removed: verification-mitaka-needed |
For what it's worth we're running Ubuntu 14.04 + Mitaka from the Ubuntu Cloud Archive.