Ubuntu
ceph package

Please backport fixes from 10.2.3 and tip for RadosGW

Bug #1628750 reported by James Troup on 2016-09-29

This bug affects 1 person

	Status	Importance	Assigned to
Ubuntu Cloud Archive	Invalid	Undecided	Unassigned
Mitaka	Fix Released	Critical	James Page
ceph (Ubuntu)	Fix Released	Critical	James Page
Xenial	Fix Released	Critical	James Page
Yakkety	Fix Released	Critical	James Page

Bug Description

[Impact]
In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats. As a result, clusters go read-only due to data availability.

[Test Case]
Non-trivial to reproduce - see original bug report.

[Regression Potential]
Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result. This is configurable so can be tuned on a per environment basis.

The patch has been accepted into the Ceph master branch, but is not currently targetted as a stable fix for Jewel.

>> Original Bug Report <<

We've run into significant issues with RadosGW at scale; we have a
customer who has ½ billion objects in ~20Tb of data and whenever they
lose an OSD for whatever reason, even for a very short period of time,
ceph was taking hours and hours to recover. The whole time it was
recovering requests to RadosGW were hanging.

I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk:

  * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel.
  * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel.
  * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master.

The 2 from 10.2.3 are because pg_temp was implicated in one of the
longer outages we had.

The last one is what I think actually got us to a point where ceph was
stable and I found it via the following URL chain:

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html
-> http://tracker.ceph.com/issues/16128
-> https://github.com/ceph/ceph/pull/9894
-> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3

With these 3 patches applied the customer has been stable for 4 days
now but I've yet to restart the entire cluster (only the stuck OSDs)
so it's hard to be completely sure that all our issues are resolved
but also which of the patches fixed things.

I've attached the debdiff I used for reference.

See original description

Tags:

Revision history for this message

James Troup (elmo) wrote on 2016-09-29:

For what it's worth we're running Ubuntu 14.04 + Mitaka from the Ubuntu Cloud Archive.

Revision history for this message

James Troup (elmo) wrote on 2016-09-29:

ceph.debdiff Edit (6.2 KiB, text/plain)

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2016-09-29:

The attachment "ceph.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags:

added: patch

James Page (james-page) on 2016-09-29

Changed in ceph (Ubuntu):
status:	New → Triaged
importance:	Undecided → Critical
Changed in cloud-archive:
status:	New → Invalid
Changed in ceph (Ubuntu Xenial):
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

James Page (james-page) wrote on 2016-09-29:

10.2.3 tracked under bug 1628809

James Page (james-page) on 2016-09-29

Changed in ceph (Ubuntu Xenial):
assignee:	nobody → James Page (james-page)
Changed in ceph (Ubuntu Yakkety):
assignee:	nobody → James Page (james-page)

James Page (james-page) on 2016-09-30

description:

updated

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-10-01:

This bug was fixed in the package ceph - 10.2.3-0ubuntu1

---------------
ceph (10.2.3-0ubuntu1) yakkety; urgency=medium

  * New upstream point release (LP: #1628809):
    - d/p/rocksdb-flags.patch: Dropped, included upstream.
    - d/p/*: Refreshed.
    - d/p/32bit-ftbfs.patch: Cherry pick fix for 32bit arch compat.
    - d/ceph-{fs-common,fuse}.install: Fix install locations
      for mount{.fuse}.ceph.
  * Limit the amount of data per chunk in omap push operations to 64k,
    ensuring that OSD threads don't hit timeouts during recovery
    operations (LP: #1628750):
    - d/p/osd-limit-omap-data-in-push-op.patch: Cherry pick fix from
      upstream master branch.

-- James Page <email address hidden> Thu, 29 Sep 2016 21:44:33 +0100

Changed in ceph (Ubuntu Yakkety):
status:	Triaged → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2016-10-05: Please test proposed package

Hello James, or anyone else affected,

Accepted ceph into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/10.2.3-0ubuntu0.16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Xenial):
status:	Triaged → Fix Committed
tags:	added: verification-needed

Revision history for this message

James Troup (elmo) wrote on 2016-10-07:

FYI, I can't test the xenial-proposed upload on the customer site as they're running Trusty with Mitaka providing Ceph. I'll try and find a Xenial deployment to test this on, but we don't have many customers using RadosGW.

Revision history for this message

Frode Nordahl (fnordahl) wrote on 2016-10-14:

Working on a lab+test to attempt to verify this on Xenial. Will report progress as soon as I have predictible and recurring results.

@elmo: How is the patch holding up in your Trusty+Mitaka environment so far?

Frode Nordahl (fnordahl) on 2016-10-14

tags:

added: sts

Revision history for this message

James Page (james-page) wrote on 2016-10-19:

Hello James, or anyone else affected,

Accepted ceph into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

sudo add-apt-repository cloud-archive:mitaka-proposed
sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:

added: verification-mitaka-needed

Revision history for this message

Brian Murray (brian-murray) wrote on 2016-11-16:

#10

Hello James, or anyone else affected,

Accepted ceph into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/10.2.3-0ubuntu0.16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

James Page (james-page) on 2016-11-28

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Robie Basak (racb) wrote on 2016-11-30: Update Released

#11

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-11-30:

#12

This bug was fixed in the package ceph - 10.2.3-0ubuntu0.16.04.2

---------------
ceph (10.2.3-0ubuntu0.16.04.2) xenial; urgency=medium

  * rgw: Fixes for creation times for buckets (LP: #1587261):
    - d/p/rgw_rados-creation_time.patch: Backport fix from upstream master.
      Fix logic error that leads to creation time being 0 instead of current
      time when creating buckets.

ceph (10.2.3-0ubuntu0.16.04.1) xenial; urgency=medium

  * New upstream stable release (LP: #1628809).
    - d/p/*: Refresh.
    - d/p/rocksdb-flags.patch: Dropped, accepted upstream.
    - d/p/32bit-ftbfs.patch: Cherry pick fix for 32bit arch compat.
    - d/ceph-{fs-common,fuse}.install: Fix install locations
      for mount{.fuse}.ceph.
  * Limit the amount of data per chunk in omap push operations to 64k,
    ensuring that OSD threads don't hit timeouts during recovery
    operations (LP: #1628750):
    - d/p/osd-limit-omap-data-in-push-op.patch: Cherry pick fix from
      upstream master branch.

-- Frode Nordahl <email address hidden> Fri, 28 Oct 2016 13:50:40 +0200

Changed in ceph (Ubuntu Xenial):
status:	Fix Committed → Fix Released

Frode Nordahl (fnordahl) on 2017-02-28

tags:

added: verification-mitaka-done
removed: verification-mitaka-needed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

ceph.debdiff Edit

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuceph package

Please backport fixes from 10.2.3 and tip for RadosGW

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
ceph package