group membership not updated with redis driver and member death

Bug #1386684 reported by Chris Dent
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tooz
Fix Released
High
Unassigned

Bug Description

When using the redis driver as the coordinator with ceilometer-compute-agent, group membership is not updated when an existing member departs the group (for example by being manually terminated).

  redis server 2.6.16
  python-redis 2.10.3
  tooz 0.8
  ceilometer-compute-agent from master, running under devstack master

The replication scenario goes something like this:

* Update pipeline.yaml so that readings happen frequently enough not to get bored (I chose 30)
* Set these guys

  [coordination]
  backend_url = redis://localhost:6379
  #backend_url = memcached://localhost

  [compute]
  workload_partitioning = True

* boot several (10-ish) instances with nova
* run one ceilometer-compute-agent
* tail -f ceilometer-agent-compute.log |grep coord # watch group membership
* run another ceilometer-compute-agent
* see membership change and partitioning update in log
* wait a while
* kill one of the agents
* see membership and partitioning not update
* try again with memcached, see it work okay

Is this a version specific problem or a mistake in the driver?

Revision history for this message
Chris Dent (cdent) wrote :

Confirmed on 2.8.17

Also: https://github.com/stackforge/tooz/blob/master/tooz/drivers/redis.py#L42 (there is no TimeoutError in exceptions in python-redis 2.10.3

(mentioning it here as I thought it might be a contributing factor, but apparently not)

Revision history for this message
Chris Dent (cdent) wrote :

that's "confirmed on redis-server 2.8.17"

Revision history for this message
Chris Dent (cdent) wrote :

The problem with TimeoutError was a red-herring: confused python-redis versions

Revision history for this message
Joshua Harlow (harlowja) wrote :

Not sure why this didn't post here:

https://review.openstack.org/#/c/131533/

Revision history for this message
Chris Dent (cdent) wrote :

I can confirm that the proposed fix seems to fix the problem. Adding and removing agents impacts the partitioning as expected.

A related observation: I'm not certain but it appears that the catchup (at least when on a 30 second polling cycle) is a bit slower than when I've previously done similar tests with memcache. I'm guessing the timeout handling is a bit different? I don't think we need to expect consistency between drivers on this front, just noting it for reference.

Julien Danjou (jdanjou)
Changed in python-tooz:
status: New → Fix Committed
importance: Undecided → High
Julien Danjou (jdanjou)
Changed in python-tooz:
milestone: none → 0.10
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.