gnocchi dispatcher still sending far too many resource updates

Bug #1483634 reported by Chris Dent
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
High
Chris Dent
Gnocchi
Fix Released
High
Julien Danjou

Bug Description

I've just started a devstack and the ceilometer-collector is sending update_resource on the order of some hundreds per second but neither the notification agent nor the two polling agent are pushing measures at anything close to that rate. It appears the collector is trapped in some kind of loop after first receiving some meters for image data.

More data will be put here as I continue my investigations.

Revision history for this message
Chris Dent (cdent) wrote :

It looks like there was a queue of some kind and it has finally caught up, but is still sending many updates per cycle.

I reckon we're still not quite there on the dispatcher, will update the title to reflect.

summary: - gnocchi dispatcher constantly resending measure updates even though new
- metrics not being processed
+ gnocchi dispatcher still sending far too many resource updates
Revision history for this message
Chris Dent (cdent) wrote :

I recognize this is a horrible bug I've not had a chance to update with actual data. Will. Promise!

Revision history for this message
gordon chung (chungg) wrote :

i put your promise on the record.

Changed in ceilometer:
importance: Undecided → High
status: New → Incomplete
status: Incomplete → Triaged
assignee: nobody → Chris Dent (chdent)
Revision history for this message
Chris Dent (cdent) wrote :

So it turns out that the dispatcher is sending resource posts that the server rejects because the ids are no good: http://paste.openstack.org/show/412855/

This error is not handled in a useful way and a few bad things happen:

* the meters for that resource (with the bad id) are still posted but are associated with no resource and no metric name
* since the resource never makes it to the server when we get more measures from the same resource in the collector, the dispatcher sends it off to the server again, creating more orphaned metrics

There seem to be a few problems here:

* we need a bad resource_id to good resource_id translation layer
* we shouldn't send measures and then send a resource. we should send a resource and if it is reject for having bad form (rather than for being there already) we shouldn't send the associated measures as they become orphans

On top of all this, we PATCH resources that have not changed. We ought to be able to avoid this.

Julien Danjou (jdanjou)
Changed in gnocchi:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Julien Danjou (jdanjou)
no longer affects: gnocchi
Revision history for this message
Julien Danjou (jdanjou) wrote :

I'm still having a hard time finding where these orphaned resource are created in the first place.

Rubber duck mode on.

The resource are created by _process_resource() when posting measure fails. So posting measure fails the first time we see a resource, and so the dispatcher tries to create the resource via _ensure_resource_and_metric().

_ensure_resource_and_metric() calls create_resource() which supposedly ends up raising UnexpectedError because the Gnocchi API server returns "400 - your resource_id is wrong".

So no metric are created at all… in theory.

Revision history for this message
Julien Danjou (jdanjou) wrote :

s/orphaned resource/orphaned metrics/

Revision history for this message
Julien Danjou (jdanjou) wrote :

This is actually a bad bug in Gnocchi. What happens is that when you update/create a resource in the API, the code always calls convert_metric_list(). That function converts a list of metrics with either UUID or archive policy name to metric UUID: so it might create new metrics.

The dispatcher uses that when creating resource as it passes a list of metrics to create such as {"image.size": {"archive_policy_name": "low"}}. When you start the collector, suddenly you have 64 threads pumping all the samples from the queue and trying to create the same resource.

One will succeed and get a 201 Created, all the others ones will get a 409.

But all the calls that triggered the 409 will anyway have called convert_metric_list(), so new orphaned metrics have been created, and are never cleaned. That's because create/update of the resource in the API is a 2 pass process:

1. Convert the metric list and create metrics if needed
2. Create/update the resource with those metrics

If 2. fails, we might return a 409, but we never rollback 1.

Changed in gnocchi:
importance: Undecided → High
assignee: nobody → Julien Danjou (jdanjou)
status: New → Confirmed
milestone: none → 1.2.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to gnocchi (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/214121

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on gnocchi (master)

Change abandoned by Julien Danjou (<email address hidden>) on branch: master
Review: https://review.openstack.org/214121
Reason: Duplicate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/215235

Changed in gnocchi:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/213803
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=f86dcb13f3409c92b14f0ced82d6a705cdc935d1
Submitter: Jenkins
Branch: master

commit f86dcb13f3409c92b14f0ced82d6a705cdc935d1
Author: Julien Danjou <email address hidden>
Date: Mon Aug 17 16:19:24 2015 +0200

    storage: remove create_metric()

    This changes the storage driver API so that there's no need to create
    metric before adding metrics. We let the indexer be responsible for the
    status of a metric to be existing or non existing, and we allow the
    storage driver to create metric on the fly.

    Pros:

    1. Speeds up the metric creation process as we do not need to reach the
       storage to create a container + empty archives
    2. Avoid us for a COMMIT/ROLLBACK step we would have to handle in the
       metric creation process, in the case where creating a metric either
       storage or indexer would fail – we would have to rollback the
       creation that did not fail.

    Cons:

    1. We now acknowledge that it's more difficult to have an autonomous
       storage driver working without an indexer. While this was no obvious
       before Gnocchi 1.0.0, it's getting pretty clear now that we don't
       want to bypass the indexer as it's anyway responsible for things such
       as RBAC. And we can anyway cache him.

    This will help fixing bug #1483634 since we'll be able to create metric
    only in the indexer, doing that in only one single transaction.

    This change also handles correctly deletion of unprocessed measures for
    metrics that has been deleted.

    Change-Id: I81ff5ca5540a8a02d378c289bed611a1329f9325
    Related-Bug: #1483634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ceilometer (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/217017

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/215235
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=3a73b911a1ef53a13f826f621c1c053df9db29a7
Submitter: Jenkins
Branch: master

commit 3a73b911a1ef53a13f826f621c1c053df9db29a7
Author: Julien Danjou <email address hidden>
Date: Tue Aug 18 13:02:59 2015 +0200

    rest: remove convert_metric()

    This function logic is now moved inside the indexer itself, so it can
    create the resource and metric in only one pass and one transaction that
    can be easily rolled-back on errors.

    Change-Id: I0b57adf44246bb8e84ff0e567a30667fae75f3f6
    Closes-Bug: #1483634

Changed in gnocchi:
status: In Progress → Fix Committed
Julien Danjou (jdanjou)
Changed in gnocchi:
status: Fix Committed → Fix Released
Changed in ceilometer:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/203109
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=2511cfb6e48c5d03cd198ecf9f09f36db3caced8
Submitter: Jenkins
Branch: master

commit 2511cfb6e48c5d03cd198ecf9f09f36db3caced8
Author: Chris Dent <email address hidden>
Date: Mon Nov 9 16:31:45 2015 +0000

    A dogpile cache of gnocchi resources

    What this does is store a key value pair in oslo_cache where the key
    is the resource id and the value is a hash of the frozenset of
    the attributes of the resource less the defined metrics[1]. When it
    is time to create or update a resource we ask the cache:

      Are the resource attributes I'm about to store the same as the
      last ones stored for this id?

    If the answer is yes we don't need to store the resource. That's all
    it does and that is all it needs to do because if the cache fails
    to have the correct information that's the same as the cache not
    existing in the first place.

    To get this to work in the face of eventlet's eager beavering we
    need to lock around create_resource and update_resource so that
    we have a chance to write the cache before another *_resource is
    called in this process. Superficial investigation shows that this
    works out pretty well because when, for example, you start a new
    instance the collector will all of sudden try several
    _create_resources, only one of which actually needs to happen.
    The lock makes sure only that one happens when there is just
    one collector. Where there are several collectors that won't be
    the case but _some_ of them will be stopped. And that's the point
    here: better not perfect.

    The cache is implemented using oslo_cache which can be configured
    via oslo_config with an entry such as:

        [cache]
        backend = dogpile.cache.redis
        backend_argument = url:redis://localhost:6379
        backend_argument = db:0
        backend_argument = distributed_lock:True
        backend_argument = redis_expiration_time:600

    The cache is exercised most for resource updates (as you might
    expect) but does still sometimes get engaged for resource creates
    (as described above).

    A cache_key_mangler is used to ensure that keys generated by the
    gnocchi dispatcher are in their own namespace.

    [1] Metrics are not included because they are represented as
    sub-dicts which are not hashable and thus cannot go in the
    frozenset. Since the metrics are fairly static (coming from a yaml
    file near you, soon) this shouldn't be a problem. If it is then we
    can come up with a way to create a hash that can deal with
    sub-dicts.

    Closes-Bug: #1483634
    Change-Id: I1f2da145ca87712cd2ff5b8afecf1bca0ba53788

Changed in ceilometer:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/ceilometer 6.0.0.0b1

This issue was fixed in the openstack/ceilometer 6.0.0.0b1 development milestone.

Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/217017
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=c940ccc964be455dedd73128e2ba2bae85aabaea
Submitter: Jenkins
Branch: master

commit c940ccc964be455dedd73128e2ba2bae85aabaea
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Aug 26 07:50:57 2015 +0200

    gnocchi: use events to end Gnocchi resource

    This change introduces a way to handle resources with the events
    subsystem.

    Also this need to be supported for type of resource.
    To support it, etc/ceilometer/gnocchi_resources.yaml need to be
    updated to add which event creates/updates/deletes the resource and
    the mapping between event traits and gnocchi resource attributes.

    This change adds the code to handle and the support for the 'instance'
    and 'image' resource.

    Only the delete event is support for now.

    Related-bug: #1483634
    Change-Id: Icd77137a74bccb5b2be078f206f153f0e9aa86c5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.