ceilometer service startup ordering impacts safety of sample publishing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
devstack |
Fix Released
|
Undecided
|
Chris Dent |
Bug Description
lib/ceilometer starts the compute agent before the collector. Under some circumstances this can lead to lost of samples from pollsters that publish samples using the rpc: publisher. The fix is to change the ordering. More detail follows:
When running grenade + javelin2 tests, I noticed that compute agent samples were not being collected, despite the agent showing it was publishing them. The collector was not collecting.
Attempting to test the problem in isolation it was not possible to replicate by:
* stopping the compute agent
* stopping the collector
* starting the compute agent (causing an immediate poll)
* starting the collector
These sample were collected. However it was possible to replicate by
* stopping the compute agent
* stopping the collector
* stopping rabbitmq
* starting rabbitmq
* starting the compute agent
* starting the collector
These samples were not collected. rabbitmq is keeping some state on the exchanges(?) that the agent and collector are using. When the collector is restarted without a rabbitmq restart those exchanges are rejoined and the messages are collected. When rabbitmq is restarted, the exchanges are lost, the collector has nothing to rejoin, it creates them anew.
In other words the collector is the arbiter of endpoints, at least with casts. This is perhaps why people want to use notifications for publishing samples, not cast.
In any case, changing the ordering of startup fixes it, patch forthcoming.
Changed in devstack: | |
assignee: | nobody → Chris Dent (chdent) |
Fix proposed to branch: master /review. openstack. org/113522
Review: https:/