Error in rpcpublisher

Bug #1211736 reported by Mehdi Abaakouk
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Critical
Mehdi Abaakouk

Bug Description

Hi,

I have sometimes this following backtrace:

2013-08-12 15:34:14.752 29702 ERROR ceilometer.pipeline [-] Pipeline cpu_pipeline: Continue after error from publisher <ceilometer.publisher.rpc.RPCPublisher object at 0x39cee90>
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline Traceback (most recent call last):
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline File "/vagrant/stack-master/ceilometer/ceilometer/pipeline.py", line 214, in _publish_counters
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline p.publish_counters(ctxt, transformed_counters)
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline File "/vagrant/stack-master/ceilometer/ceilometer/publisher/rpc.py", line 178, in publish_counters
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline self.flush()
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline File "/vagrant/stack-master/ceilometer/ceilometer/publisher/rpc.py", line 224, in flush
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline self.local_queue.pop(0)
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline IndexError: pop from empty list
2013-08-12 15:34:14.752 29702 TRACE ceilometer.pipeline

Cheers,

Revision history for this message
gordon chung (chungg) wrote :

i see this as well... do you think your patch to control publisher calls (https://review.openstack.org/#/c/39510/) will resolve this? seems to work for me but this could be just hiding the real bug.

Revision history for this message
Mehdi Abaakouk (sileht) wrote :

No this patch doesn't fix this issue, when I have written the code I have think that publish_counters cannot be called in a concurrent manner. And the code of the RPCPublisher.flush method is not ThreadSafe

But (I haven't checked yet), I think that gevent can stop on the rpc.cast call (because it does some IO), to handle a other publish_counters call. And because RPCPublisher.flush method is not thread safe, self.local_queue can be override by the seconds publish_counters call, and when the gevent choose to finish to run the first call the local_queue doesn't have anymore the excepted content.

I put this issue as Critical because in this case some samples are loose.

Changed in ceilometer:
assignee: nobody → Mehdi Abaakouk (sileht)
importance: Undecided → Critical
Revision history for this message
Mehdi Abaakouk (sileht) wrote :

s/gevent/eventlet/g :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/42587

Changed in ceilometer:
status: New → In Progress
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: none → havana-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/42587
Committed: http://github.com/openstack/ceilometer/commit/e7e74f7dc311fa2695514ab32584f849aecc6a1d
Submitter: Jenkins
Branch: master

commit e7e74f7dc311fa2695514ab32584f849aecc6a1d
Author: Mehdi Abaakouk <email address hidden>
Date: Mon Aug 19 10:48:18 2013 +0200

    Make RPCPublisher flush method threadsafe

    This change allow concurrent access to the local queue of the
    RPCPublisher

    Fixes bug #1211736

    Change-Id: I13371329d40e43f42c357a0893e4023f343a5efa

Changed in ceilometer:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ceilometer:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ceilometer:
milestone: havana-3 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.