Collector continuously re-queues sample when dispatcher reports persistent error when requeue is enabled

Bug #1434322 reported by Rohit Jaiswal
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceilometer
Won't Fix
Medium
Unassigned

Bug Description

When requeue_sample_on_dispatcher_error is enabled in ceilometer.conf, the collector will try to requeue the sample if it gets an error from the dispatcher.

When the sample gets requeued, it will get picked up again and dispatched to the storage layer. If the same underlying error condition prevails, the error will be raised back to the collector and the message will be requeued again. This cyclical process will continue until the error is gone(in which case the sample is not requeued again) or the collector/ RMQ are restarted.

In this scenario, when there is an persistent error condition, frequent retrying puts extra load on storage db and the messaging layer (rabbit), wasting collector CPU cycles since the message and potentially more samples keep getting requeued and not cleared from the queue.

It does not make sense to keep retrying continuously in case of a persistent error condition.

There should be a configurable upper limit to cap the number of requeues of samples/events by Collector in case of dispatcher error.

eg. requeue_sample_on_dispatcher_error_max_retries

Changed in ceilometer:
assignee: nobody → Rohit Jaiswal (rohit-jaiswal-3)
summary: - Collector keeps on requeueing a message in case of a persistent error
- from dispatcher when requeueing is enabled
+ Collector keeps on requeueing and retrying a message in case of a
+ persistent error from dispatcher when requeueing is enabled
description: updated
description: updated
description: updated
description: updated
summary: - Collector keeps on requeueing and retrying a message in case of a
- persistent error from dispatcher when requeueing is enabled
+ Collector continuously re-queues sample when dispatcher reports
+ persistent error when requeue is enabled
Revision history for this message
gordon chung (chungg) wrote :

what's the proposed solution here? i get the feeling this has multiple different solutions all with pros/cons.

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Using a new configuration param - requeue_max_retries (in ceilometer.conf under collector section) which will control how many times the collector tries to requeue sample/event in case of a repetitive failure. Once the retries max out, the sample will be dropped or can be put on a metering.error queue (maybe a durable queue), so that we dont entirely lose it and free up the collector from having to retry it. The assumption is the percentage of these error cases is low, so we dont end up with a bloated error queue.

By default, the requeue_max_retries will have a value of -1, which indicates infinite retries, the current behavior.

Revision history for this message
gordon chung (chungg) wrote :

so i guess why i asked was because for the above case, how would you know how many times you've retried a sample/event? are we marking the datapoints somehow? if you requeue it, there's really no way to know how many times you've tried something unless you keep a cache...and then you'd need a global cache to sync across multiple workers...

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

 I agree that a global cache will be needed to synchronize access and update the retry count by multiple collector worker processes. Thanks for describing this. I think we can use the distributed locking in Tooz with redis to cache the retry count. Workers will acquire the lock to update the retry count, so this might impact performance, but this will only be when requeueing is enabled and only needed to update the cache, so should be insignificant. I think this sounds more like a new feature than a bug. Let me know what you think.

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Thinking a bit more, we would also need a way to identify each sample that gets requeued, so as to maintain its retry count in the cache. This could be done in redis using key/value pairs. So we could use the unique sample id or hash a sample to generate a unique id which we use then use to look up the retry count in redis. So we can maintain multiple sample requeue retry counts in cache and use tooz to synchronize access to this struct by multiple workers which potentially may be retrying the same or different sample(s).

Revision history for this message
gordon chung (chungg) wrote :

i'm not sure sample_id is unique -- the hash technique sounds good though... or whatever we're setting as unique id in sql.

tooz could work? we could also look at dogpile.cache which has some expiration capabilities.

Changed in ceilometer:
status: New → Triaged
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

dogpile.cache seems like a good fit, keystone already uses it. I dont think ceilometer has a cache layer, that said, i think this a new feature to ceilometer (collector) and the new requirement of limiting the requeuing. Does this deserve a blueprint?

ZhiQiang Fan (aji-zqfan)
Changed in ceilometer:
status: Triaged → New
assignee: Rohit Jaiswal (rohit-jaiswal-3) → nobody
gordon chung (chungg)
Changed in ceilometer:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
gordon chung (chungg) wrote :

n/a. storage is dead. long live gnocchi (or whatever proprietary/open solution you like)

Changed in ceilometer:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.