Please backport fixes from 10.2.3 and tip for RadosGW

Bug #1628750 reported by James Troup
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Undecided
Unassigned
Mitaka
Fix Released
Critical
James Page
ceph (Ubuntu)
Fix Released
Critical
James Page
Xenial
Fix Released
Critical
James Page
Yakkety
Fix Released
Critical
James Page

Bug Description

[Impact]
In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats. As a result, clusters go read-only due to data availability.

[Test Case]
Non-trivial to reproduce - see original bug report.

[Regression Potential]
Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result. This is configurable so can be tuned on a per environment basis.

The patch has been accepted into the Ceph master branch, but is not currently targetted as a stable fix for Jewel.

>> Original Bug Report <<

We've run into significant issues with RadosGW at scale; we have a
customer who has ½ billion objects in ~20Tb of data and whenever they
lose an OSD for whatever reason, even for a very short period of time,
ceph was taking hours and hours to recover. The whole time it was
recovering requests to RadosGW were hanging.

I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk:

  * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel.
  * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel.
  * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master.

The 2 from 10.2.3 are because pg_temp was implicated in one of the
longer outages we had.

The last one is what I think actually got us to a point where ceph was
stable and I found it via the following URL chain:

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html
-> http://tracker.ceph.com/issues/16128
-> https://github.com/ceph/ceph/pull/9894
-> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3

With these 3 patches applied the customer has been stable for 4 days
now but I've yet to restart the entire cluster (only the stuck OSDs)
so it's hard to be completely sure that all our issues are resolved
but also which of the patches fixed things.

I've attached the debdiff I used for reference.

Revision history for this message
James Troup (elmo) wrote :

For what it's worth we're running Ubuntu 14.04 + Mitaka from the Ubuntu Cloud Archive.

Revision history for this message
James Troup (elmo) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "ceph.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
James Page (james-page)
Changed in ceph (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
Changed in cloud-archive:
status: New → Invalid
Changed in ceph (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
James Page (james-page) wrote :

10.2.3 tracked under bug 1628809

James Page (james-page)
Changed in ceph (Ubuntu Xenial):
assignee: nobody → James Page (james-page)
Changed in ceph (Ubuntu Yakkety):
assignee: nobody → James Page (james-page)
James Page (james-page)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 10.2.3-0ubuntu1

---------------
ceph (10.2.3-0ubuntu1) yakkety; urgency=medium

  * New upstream point release (LP: #1628809):
    - d/p/rocksdb-flags.patch: Dropped, included upstream.
    - d/p/*: Refreshed.
    - d/p/32bit-ftbfs.patch: Cherry pick fix for 32bit arch compat.
    - d/ceph-{fs-common,fuse}.install: Fix install locations
      for mount{.fuse}.ceph.
  * Limit the amount of data per chunk in omap push operations to 64k,
    ensuring that OSD threads don't hit timeouts during recovery
    operations (LP: #1628750):
    - d/p/osd-limit-omap-data-in-push-op.patch: Cherry pick fix from
      upstream master branch.

 -- James Page <email address hidden> Thu, 29 Sep 2016 21:44:33 +0100

Changed in ceph (Ubuntu Yakkety):
status: Triaged → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello James, or anyone else affected,

Accepted ceph into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/10.2.3-0ubuntu0.16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
James Troup (elmo) wrote :

FYI, I can't test the xenial-proposed upload on the customer site as they're running Trusty with Mitaka providing Ceph. I'll try and find a Xenial deployment to test this on, but we don't have many customers using RadosGW.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Working on a lab+test to attempt to verify this on Xenial. Will report progress as soon as I have predictible and recurring results.

@elmo: How is the patch holding up in your Trusty+Mitaka environment so far?

Frode Nordahl (fnordahl)
tags: added: sts
Revision history for this message
James Page (james-page) wrote :

Hello James, or anyone else affected,

Accepted ceph into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello James, or anyone else affected,

Accepted ceph into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/10.2.3-0ubuntu0.16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

James Page (james-page)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 10.2.3-0ubuntu0.16.04.2

---------------
ceph (10.2.3-0ubuntu0.16.04.2) xenial; urgency=medium

  * rgw: Fixes for creation times for buckets (LP: #1587261):
    - d/p/rgw_rados-creation_time.patch: Backport fix from upstream master.
      Fix logic error that leads to creation time being 0 instead of current
      time when creating buckets.

ceph (10.2.3-0ubuntu0.16.04.1) xenial; urgency=medium

  * New upstream stable release (LP: #1628809).
    - d/p/*: Refresh.
    - d/p/rocksdb-flags.patch: Dropped, accepted upstream.
    - d/p/32bit-ftbfs.patch: Cherry pick fix for 32bit arch compat.
    - d/ceph-{fs-common,fuse}.install: Fix install locations
      for mount{.fuse}.ceph.
  * Limit the amount of data per chunk in omap push operations to 64k,
    ensuring that OSD threads don't hit timeouts during recovery
    operations (LP: #1628750):
    - d/p/osd-limit-omap-data-in-push-op.patch: Cherry pick fix from
      upstream master branch.

 -- Frode Nordahl <email address hidden> Fri, 28 Oct 2016 13:50:40 +0200

Changed in ceph (Ubuntu Xenial):
status: Fix Committed → Fix Released
Frode Nordahl (fnordahl)
tags: added: verification-mitaka-done
removed: verification-mitaka-needed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.