mongo aggregation pipeline for resource retrieval fails with excessive memory use

Bug #1262571 reported by Eoghan Glynn on 2013-12-19
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceilometer
High
Eoghan Glynn
Havana
High
Eoghan Glynn

Bug Description

The mongodb storage driver currently uses an aggregation pipeline over the meter collection in order to construct a list of resources adorned with first & last sample timestamps etc.

The problem with this approach is that the mongodb aggregation framework performs sorting in-memory, in this case operating over a potentially very large collection (particularly if the GET /v2/resources was not constrained with query params, e.g. to limit to a single tenant for example).

It turns out the mongodb innards are hardcoded to abort any sorts in an aggregation pipeline that will consume more than 10% of physical memory. The net result is that we see failures in production such as:

ERROR wsme.api [-] Server-side error: "command SON([('aggregate', u'meter'), ('pipeline', [{'$match': {}},
{'$sort': {'timestamp': -1, 'project_id': -1, 'user_id': -1}}, {'$group': {'meters_unit': {'$push': '$counter_unit'},
'source': {'$first': '$source'}, 'project_id': {'$first': '$project_id'},
'user_id': {'$first': '$user_id'}, 'last_sample_timestamp': {'$max': '$timestamp'},
'meters_name': {'$push': '$counter_name'}, 'first_sample_timestamp': {'$min': '$timestamp'},
'meters_type': {'$push': '$counter_type'}, '_id': '$resource_id', 'metadata': {'$first': '$resource_metadata'}}}])])
failed: exception: terminating request: request heap use exceeded 10% of physical RAM"

Discussion of the fossil record on gerrit indicates that the use of the aggregation framework in this context was primarily for convenience:

  https://review.openstack.org/35297

Switching over to storing the first and last timestamps in the resource collection directly (and updating these on every sample insert) is not a workable approach, as there are no universal first and last timestamps for a resource that will always be applicable regardless on the constraints on the resource query.

Hence the workable approaches to resolving this issue are:

1. avoid the need for sorting in-memory by ensuring sufficient indices exist on the meter collection (currently the sort instructions for resource retrieval default to timestamp, project_id, user_id all descending)

2. avoid the aggregation framework altogether and instead revert to the equivalent map-reduce

Note that resource retrieval is the only case where the aggregation framework is currently used by the mongodb storage driver.

Eoghan Glynn (eglynn) on 2014-01-07
Changed in ceilometer:
milestone: none → icehouse-2
assignee: nobody → Eoghan Glynn (eglynn)
importance: Undecided → High
status: New → In Progress

Reviewed: https://review.openstack.org/65962
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=ba6641afacfc52e7391d2095751ee96d62a64c25
Submitter: Jenkins
Branch: master

commit ba6641afacfc52e7391d2095751ee96d62a64c25
Author: Eoghan Glynn <email address hidden>
Date: Thu Jan 9 16:30:10 2014 +0000

    Replace mongo aggregation with plain ol' map-reduce

    Fixes bug 1262571

    Previously, the mongodb storage driver an aggregation pipeline
    over the meter collection in order to construct a list of resources
    adorned with first & last sample timestamps etc.

    However mongodb aggregation framework performs sorting in-memory,
    in this case operating over a potentially very large collection.
    It is also hardcoded to abort any sorts in an aggregation pipeline
    that will consume more than 10% of physical memory, which is
    observed in this case.

    Now, we avoid the aggregation framework altogether and instead
    use an equivalent map-reduce.

    Change-Id: Ibef4a95acada411af385ff75ccb36c5724068b59

Changed in ceilometer:
status: In Progress → Fix Committed
Eoghan Glynn (eglynn) on 2014-01-15
tags: added: havana-backport-potential
Thierry Carrez (ttx) on 2014-01-22
Changed in ceilometer:
status: Fix Committed → Fix Released
Alan Pevec (apevec) on 2014-02-04
tags: removed: havana-backport-potential
Thierry Carrez (ttx) on 2014-04-17
Changed in ceilometer:
milestone: icehouse-2 → 2014.1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers