workload_partitioning inconsistently reports group members
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Ceilometer |
Fix Released
|
Critical
|
gordon chung | |
| Liberty |
Fix Committed
|
Undecided
|
gordon chung | |
| tooz |
Undecided
|
Unassigned |
Bug Description
using both redis and memcache drivers, the group members returned by coordinator are often incorrect and sometime report back not even the calling agent itself.
gordon chung (chungg) wrote : | #1 |
Changed in ceilometer: | |
status: | New → Triaged |
importance: | Undecided → Critical |
assignee: | nobody → gordon chung (chungg) |
Julien Danjou (jdanjou) wrote : | #2 |
Joining a group with the memcached or redis driver is atomic, so that shouldn't be happening. We really need more info at this point.
Changed in python-tooz: | |
status: | New → Incomplete |
Fix proposed to branch: master
Review: https:/
Changed in ceilometer: | |
status: | Triaged → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit e84a10882a9b682
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500
better support notification coordination
when launching multiple agents at same time, there is a chance that
agents will miss the registry of another agent. this is possible
because there is a lot of overhead involved when starting up agents,
specifically with initialising managers.
this change makes it so the agent only joins the group AFTER it has
done all setup that does not require coordination. after it joins,
we start listening right away for other changes to group membership
additionally, this adds a lock to pipeline queue setup so only one
event at any time can trigger a reconfiguration.
Change-Id: I8100160a3aa83a
Closes-Bug: #1533787
Changed in ceilometer: | |
status: | In Progress → Fix Released |
Fix proposed to branch: stable/liberty
Review: https:/
Rohit Jaiswal (rohit-jaiswal-3) wrote : | #6 |
Can this issue also occur with the ZK backend?
In our deployment, we enable workload_
check_watchers=10.0
heartbeat=1.0
But we dont see any samples being published, ls /tooz/ceilomete
gordon chung (chungg) wrote : | #7 |
hey Rohit,
i'm still trying to test this. going to try and test with just tooz... do you see consumers on main service queues?
one interesting test run i've seen is: http://
basically both agents failed to join group. https:/
Rohit Jaiswal (rohit-jaiswal-3) wrote : | #8 |
gordc: i see consumers for main queues but not for pipeline queues, will having a retry in the member retrieval logic help here? We could max out a configurable number of retries before raising the tooz.coordinati
gordon chung (chungg) wrote : | #9 |
hmm.. this is weirder:
1st worker joins sees itself:
http://
2nd worker joins, sees only itself (not 1st agent):
http://
1st worker tries to pull tasks again and finds nothing:
http://
2nd worker tries to pull tasks again and finds nothing:
http://
gordon chung (chungg) wrote : | #10 |
sigh. i think it's something related to this: https:/
Fix proposed to branch: master
Review: https:/
Changed in ceilometer: | |
status: | Fix Released → In Progress |
Changed in python-tooz: | |
status: | Incomplete → Invalid |
Fix proposed to branch: stable/liberty
Review: https:/
Fix proposed to branch: stable/kilo
Review: https:/
Change abandoned by gordon chung (<email address hidden>) on branch: stable/kilo
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 919af096f79c886
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500
do not configure worker specific items in init
this further corrects some issues with coordination. the basic issue
is described in a bug i fixed a while ago[1]. basically everything
defined in init is sort shared by workers. if it's unique to a worker
it should not be defined in init.
the weird addition to stop() is to maintain bug1418793[2]
[1] I2ad05e2085c0c0
[2] Ied2f086e1f5095
Change-Id: I979fdcd350c9a3
Closes-Bug: #1533787
Changed in ceilometer: | |
status: | In Progress → Fix Released |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/liberty
commit 67e47cda8e7e0d2
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500
better support notification coordination
when launching multiple agents at same time, there is a chance that
agents will miss the registry of another agent. this is possible
because there is a lot of overhead involved when starting up agents,
specifically with initialising managers.
this change makes it so the agent only joins the group AFTER it has
done all setup that does not require coordination. after it joins,
we start listening right away for other changes to group membership
additionally, this adds a lock to pipeline queue setup so only one
event at any time can trigger a reconfiguration.
Change-Id: I8100160a3aa83a
Closes-Bug: #1533787
This issue was fixed in the openstack/
This issue was fixed in the openstack/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/liberty
commit 7dbaf4c207253d2
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500
do not configure worker specific items in init
this further corrects some issues with coordination. the basic issue
is described in a bug i fixed a while ago[1]. basically everything
defined in init is sort shared by workers. if it's unique to a worker
it should not be defined in init.
the weird addition to stop() is to maintain bug1418793[2]
[1] I2ad05e2085c0c0
[2] Ied2f086e1f5095
Change-Id: I979fdcd350c9a3
Closes-Bug: #1533787
This issue was fixed in the openstack/
there appears to be some race happening. basically if you start multiple agents at the same time. there is a chance that one join will overwrite another join. because of this, members are inadvertently removed from group.