ceilometer service startup ordering impacts safety of sample publishing

Bug #1355809 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Undecided
Chris Dent

Bug Description

lib/ceilometer starts the compute agent before the collector. Under some circumstances this can lead to lost of samples from pollsters that publish samples using the rpc: publisher. The fix is to change the ordering. More detail follows:

When running grenade + javelin2 tests, I noticed that compute agent samples were not being collected, despite the agent showing it was publishing them. The collector was not collecting.

Attempting to test the problem in isolation it was not possible to replicate by:

* stopping the compute agent
* stopping the collector
* starting the compute agent (causing an immediate poll)
* starting the collector

These sample were collected. However it was possible to replicate by

* stopping the compute agent
* stopping the collector
* stopping rabbitmq
* starting rabbitmq
* starting the compute agent
* starting the collector

These samples were not collected. rabbitmq is keeping some state on the exchanges(?) that the agent and collector are using. When the collector is restarted without a rabbitmq restart those exchanges are rejoined and the messages are collected. When rabbitmq is restarted, the exchanges are lost, the collector has nothing to rejoin, it creates them anew.

In other words the collector is the arbiter of endpoints, at least with casts. This is perhaps why people want to use notifications for publishing samples, not cast.

In any case, changing the ordering of startup fixes it, patch forthcoming.

Chris Dent (cdent)
Changed in devstack:
assignee: nobody → Chris Dent (chdent)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack (master)

Fix proposed to branch: master
Review: https://review.openstack.org/113522

Changed in devstack:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/113522
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=4922bfa84674aa8f84f8c65dd5123153495b2717
Submitter: Jenkins
Branch: master

commit 4922bfa84674aa8f84f8c65dd5123153495b2717
Author: Chris Dent <email address hidden>
Date: Tue Aug 12 14:27:58 2014 +0100

    Change ordering of ceilometer service startup

    If the compute-agent starts before the collector directly
    after a start or restart of the AMQP service, samples published from
    the compute-agent can be lost before the collector has had a chance
    to establish connections. These lost samples impact the reliability
    of tests which run immediately after the service [re]start.

    Note: if there is a restart of the ceilo service, but not the AMQP
    service, the problem does not present itself becaue the messaging
    service maintains some state on the exchanges it keeps.

    Change-Id: I1c06d0511fbf93050cda56d9d2de0ff00813dfb6
    Closes-bug: 1355809

Changed in devstack:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.