Activity log for bug #1628750

Date Who What changed Old value New value Message
2016-09-29 03:01:07 James Troup bug added bug
2016-09-29 03:02:48 James Troup attachment added ceph.debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1628750/+attachment/4750653/+files/ceph.debdiff
2016-09-29 03:03:21 James Troup bug added subscriber The Canonical Sysadmins
2016-09-29 04:33:13 Ubuntu Foundations Team Bug Bot tags patch
2016-09-29 04:33:23 Ubuntu Foundations Team Bug Bot bug added subscriber Ubuntu Sponsors Team
2016-09-29 07:06:11 James Page ceph (Ubuntu): status New Triaged
2016-09-29 07:06:15 James Page ceph (Ubuntu): importance Undecided Critical
2016-09-29 07:06:28 James Page nominated for series Ubuntu Yakkety
2016-09-29 07:06:28 James Page bug task added ceph (Ubuntu Yakkety)
2016-09-29 07:06:28 James Page nominated for series Ubuntu Xenial
2016-09-29 07:06:28 James Page bug task added ceph (Ubuntu Xenial)
2016-09-29 07:10:43 James Page bug task added cloud-archive
2016-09-29 07:10:52 James Page nominated for series cloud-archive/mitaka
2016-09-29 07:10:52 James Page bug task added cloud-archive/mitaka
2016-09-29 07:10:59 James Page cloud-archive: status New Invalid
2016-09-29 07:11:04 James Page cloud-archive/mitaka: status New Triaged
2016-09-29 07:11:06 James Page ceph (Ubuntu Xenial): status New Triaged
2016-09-29 07:11:09 James Page cloud-archive/mitaka: importance Undecided Critical
2016-09-29 07:11:11 James Page ceph (Ubuntu Xenial): importance Undecided Critical
2016-09-29 12:03:59 Edward Hope-Morley bug added subscriber Edward Hope-Morley
2016-09-29 12:58:25 James Page cloud-archive/mitaka: assignee James Page (james-page)
2016-09-29 12:58:27 James Page ceph (Ubuntu Xenial): assignee James Page (james-page)
2016-09-29 12:58:29 James Page ceph (Ubuntu Yakkety): assignee James Page (james-page)
2016-09-30 09:50:53 James Page description We've run into significant issues with RadosGW at scale; we have a customer who has ½ billion objects in ~20Tb of data and whenever they lose an OSD for whatever reason, even for a very short period of time, ceph was taking hours and hours to recover. The whole time it was recovering requests to RadosGW were hanging. I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk: * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel. * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel. * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master. The 2 from 10.2.3 are because pg_temp was implicated in one of the longer outages we had. The last one is what I think actually got us to a point where ceph was stable and I found it via the following URL chain: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html -> http://tracker.ceph.com/issues/16128 -> https://github.com/ceph/ceph/pull/9894 -> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3 With these 3 patches applied the customer has been stable for 4 days now but I've yet to restart the entire cluster (only the stuck OSDs) so it's hard to be completely sure that all our issues are resolved but also which of the patches fixed things. I've attached the debdiff I used for reference. [Impact] In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats. As a result, clusters go read-only due to data availability. [Test Case] Non-trivial to reproduce - see original bug report. [Regression Potential] Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result. This is configurable so can be tuned on a per environment basis. The patch has been accepted into the Ceph master branch, but is not currently targetted as a stable fix for Jewel. >> Original Bug Report << We've run into significant issues with RadosGW at scale; we have a customer who has ½ billion objects in ~20Tb of data and whenever they lose an OSD for whatever reason, even for a very short period of time, ceph was taking hours and hours to recover. The whole time it was recovering requests to RadosGW were hanging. I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk: * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel. * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel. * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master. The 2 from 10.2.3 are because pg_temp was implicated in one of the longer outages we had. The last one is what I think actually got us to a point where ceph was stable and I found it via the following URL chain: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html -> http://tracker.ceph.com/issues/16128 -> https://github.com/ceph/ceph/pull/9894 -> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3 With these 3 patches applied the customer has been stable for 4 days now but I've yet to restart the entire cluster (only the stuck OSDs) so it's hard to be completely sure that all our issues are resolved but also which of the patches fixed things. I've attached the debdiff I used for reference.
2016-09-30 09:51:45 James Page bug added subscriber Ubuntu Stable Release Updates Team
2016-10-01 03:47:48 Launchpad Janitor ceph (Ubuntu Yakkety): status Triaged Fix Released
2016-10-05 15:03:28 Brian Murray ceph (Ubuntu Xenial): status Triaged Fix Committed
2016-10-05 15:03:32 Brian Murray bug added subscriber SRU Verification
2016-10-05 15:03:36 Brian Murray tags patch patch verification-needed
2016-10-14 19:44:47 Frode Nordahl tags patch verification-needed patch sts verification-needed
2016-10-19 14:52:57 James Page cloud-archive/mitaka: status Triaged Fix Committed
2016-10-19 14:53:00 James Page tags patch sts verification-needed patch sts verification-mitaka-needed verification-needed
2016-11-22 23:05:35 Brian Murray removed subscriber Ubuntu Sponsors Team
2016-11-28 11:20:22 James Page tags patch sts verification-mitaka-needed verification-needed patch sts verification-done verification-mitaka-needed
2016-11-30 12:42:46 Robie Basak removed subscriber Ubuntu Stable Release Updates Team
2016-11-30 12:52:45 Launchpad Janitor ceph (Ubuntu Xenial): status Fix Committed Fix Released
2017-02-28 13:52:01 Frode Nordahl tags patch sts verification-done verification-mitaka-needed patch sts verification-done verification-mitaka-done
2017-02-28 13:52:12 Frode Nordahl cloud-archive/mitaka: status Fix Committed Fix Released