Ubuntu
ceph package

Bug #1628750
Activity log

Activity log for bug #1628750

Date	Who	What changed	Old value	New value	Message
2016-09-29 03:01:07	James Troup	bug			added bug
2016-09-29 03:02:48	James Troup	attachment added		ceph.debdiff https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1628750/+attachment/4750653/+files/ceph.debdiff
2016-09-29 03:03:21	James Troup	bug			added subscriber The Canonical Sysadmins
2016-09-29 04:33:13	Ubuntu Foundations Team Bug Bot	tags		patch
2016-09-29 04:33:23	Ubuntu Foundations Team Bug Bot	bug			added subscriber Ubuntu Sponsors Team
2016-09-29 07:06:11	James Page	ceph (Ubuntu): status	New	Triaged
2016-09-29 07:06:15	James Page	ceph (Ubuntu): importance	Undecided	Critical
2016-09-29 07:06:28	James Page	nominated for series		Ubuntu Yakkety
2016-09-29 07:06:28	James Page	bug task added		ceph (Ubuntu Yakkety)
2016-09-29 07:06:28	James Page	nominated for series		Ubuntu Xenial
2016-09-29 07:06:28	James Page	bug task added		ceph (Ubuntu Xenial)
2016-09-29 07:10:43	James Page	bug task added		cloud-archive
2016-09-29 07:10:52	James Page	nominated for series		cloud-archive/mitaka
2016-09-29 07:10:52	James Page	bug task added		cloud-archive/mitaka
2016-09-29 07:10:59	James Page	cloud-archive: status	New	Invalid
2016-09-29 07:11:04	James Page	cloud-archive/mitaka: status	New	Triaged
2016-09-29 07:11:06	James Page	ceph (Ubuntu Xenial): status	New	Triaged
2016-09-29 07:11:09	James Page	cloud-archive/mitaka: importance	Undecided	Critical
2016-09-29 07:11:11	James Page	ceph (Ubuntu Xenial): importance	Undecided	Critical
2016-09-29 12:03:59	Edward Hope-Morley	bug			added subscriber Edward Hope-Morley
2016-09-29 12:58:25	James Page	cloud-archive/mitaka: assignee		James Page (james-page)
2016-09-29 12:58:27	James Page	ceph (Ubuntu Xenial): assignee		James Page (james-page)
2016-09-29 12:58:29	James Page	ceph (Ubuntu Yakkety): assignee		James Page (james-page)
2016-09-30 09:50:53	James Page	description	We've run into significant issues with RadosGW at scale; we have a customer who has ½ billion objects in ~20Tb of data and whenever they lose an OSD for whatever reason, even for a very short period of time, ceph was taking hours and hours to recover. The whole time it was recovering requests to RadosGW were hanging. I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk: * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel. * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel. * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master. The 2 from 10.2.3 are because pg_temp was implicated in one of the longer outages we had. The last one is what I think actually got us to a point where ceph was stable and I found it via the following URL chain: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html -> http://tracker.ceph.com/issues/16128 -> https://github.com/ceph/ceph/pull/9894 -> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3 With these 3 patches applied the customer has been stable for 4 days now but I've yet to restart the entire cluster (only the stuck OSDs) so it's hard to be completely sure that all our issues are resolved but also which of the patches fixed things. I've attached the debdiff I used for reference.	[Impact] In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats. As a result, clusters go read-only due to data availability. [Test Case] Non-trivial to reproduce - see original bug report. [Regression Potential] Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result. This is configurable so can be tuned on a per environment basis. The patch has been accepted into the Ceph master branch, but is not currently targetted as a stable fix for Jewel. >> Original Bug Report << We've run into significant issues with RadosGW at scale; we have a customer who has ½ billion objects in ~20Tb of data and whenever they lose an OSD for whatever reason, even for a very short period of time, ceph was taking hours and hours to recover. The whole time it was recovering requests to RadosGW were hanging. I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk: * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel. * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel. * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master. The 2 from 10.2.3 are because pg_temp was implicated in one of the longer outages we had. The last one is what I think actually got us to a point where ceph was stable and I found it via the following URL chain: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html -> http://tracker.ceph.com/issues/16128 -> https://github.com/ceph/ceph/pull/9894 -> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3 With these 3 patches applied the customer has been stable for 4 days now but I've yet to restart the entire cluster (only the stuck OSDs) so it's hard to be completely sure that all our issues are resolved but also which of the patches fixed things. I've attached the debdiff I used for reference.
2016-09-30 09:51:45	James Page	bug			added subscriber Ubuntu Stable Release Updates Team
2016-10-01 03:47:48	Launchpad Janitor	ceph (Ubuntu Yakkety): status	Triaged	Fix Released
2016-10-05 15:03:28	Brian Murray	ceph (Ubuntu Xenial): status	Triaged	Fix Committed
2016-10-05 15:03:32	Brian Murray	bug			added subscriber SRU Verification
2016-10-05 15:03:36	Brian Murray	tags	patch	patch verification-needed
2016-10-14 19:44:47	Frode Nordahl	tags	patch verification-needed	patch sts verification-needed
2016-10-19 14:52:57	James Page	cloud-archive/mitaka: status	Triaged	Fix Committed
2016-10-19 14:53:00	James Page	tags	patch sts verification-needed	patch sts verification-mitaka-needed verification-needed
2016-11-22 23:05:35	Brian Murray	removed subscriber Ubuntu Sponsors Team
2016-11-28 11:20:22	James Page	tags	patch sts verification-mitaka-needed verification-needed	patch sts verification-done verification-mitaka-needed
2016-11-30 12:42:46	Robie Basak	removed subscriber Ubuntu Stable Release Updates Team
2016-11-30 12:52:45	Launchpad Janitor	ceph (Ubuntu Xenial): status	Fix Committed	Fix Released
2017-02-28 13:52:01	Frode Nordahl	tags	patch sts verification-done verification-mitaka-needed	patch sts verification-done verification-mitaka-done
2017-02-28 13:52:12	Frode Nordahl	cloud-archive/mitaka: status	Fix Committed	Fix Released

Ubuntuceph package

Activity log for bug #1628750

Ubuntu
ceph package