workload_partitioning inconsistently reports group members

Bug #1533787 reported by gordon chung on 2016-01-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Critical
gordon chung
Liberty
Fix Committed
Undecided
gordon chung
tooz
Undecided
Unassigned

Bug Description

using both redis and memcache drivers, the group members returned by coordinator are often incorrect and sometime report back not even the calling agent itself.

gordon chung (chungg) wrote :

there appears to be some race happening. basically if you start multiple agents at the same time. there is a chance that one join will overwrite another join. because of this, members are inadvertently removed from group.

Changed in ceilometer:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → gordon chung (chungg)
Julien Danjou (jdanjou) wrote :

Joining a group with the memcached or redis driver is atomic, so that shouldn't be happening. We really need more info at this point.

Changed in python-tooz:
status: New → Incomplete

Fix proposed to branch: master
Review: https://review.openstack.org/267625

Changed in ceilometer:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/267625
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=e84a10882a9b682ff41c84e8bf4ee2497e7e7a31
Submitter: Jenkins
Branch: master

commit e84a10882a9b682ff41c84e8bf4ee2497e7e7a31
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500

    better support notification coordination

    when launching multiple agents at same time, there is a chance that
    agents will miss the registry of another agent. this is possible
    because there is a lot of overhead involved when starting up agents,
    specifically with initialising managers.

    this change makes it so the agent only joins the group AFTER it has
    done all setup that does not require coordination. after it joins,
    we start listening right away for other changes to group membership

    additionally, this adds a lock to pipeline queue setup so only one
    event at any time can trigger a reconfiguration.

    Change-Id: I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
    Closes-Bug: #1533787

Changed in ceilometer:
status: In Progress → Fix Released
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Can this issue also occur with the ZK backend?

 In our deployment, we enable workload_partitioning with the below conf:

check_watchers=10.0
heartbeat=1.0

But we dont see any samples being published, ls /tooz/ceilometer.notification in the ZK shell returns an empty set and grep for pipeline consumers/listeners returns nothing from rabbit.

gordon chung (chungg) wrote :

hey Rohit,

i'm still trying to test this. going to try and test with just tooz... do you see consumers on main service queues?

one interesting test run i've seen is: http://logs.openstack.org/92/268292/4/check/gate-tempest-dsvm-ceilometer-mysql-neutron-full/a7ee5b0/logs/screen-ceilometer-anotification.txt.gz?level=INFO

basically both agents failed to join group. https://review.openstack.org/#/c/268292/4 and so no one is processing the pipeline queues

Rohit Jaiswal (rohit-jaiswal-3) wrote :

gordc: i see consumers for main queues but not for pipeline queues, will having a retry in the member retrieval logic help here? We could max out a configurable number of retries before raising the tooz.coordination.MemberNotJoined Exception.

gordon chung (chungg) wrote :

sigh. i think it's something related to this: https://review.openstack.org/#/c/83140/

gordon chung (chungg) on 2016-01-29
Changed in ceilometer:
status: Fix Released → In Progress
gordon chung (chungg) on 2016-01-29
Changed in python-tooz:
status: Incomplete → Invalid

Change abandoned by gordon chung (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/274070

Reviewed: https://review.openstack.org/273792
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=919af096f79c8861ee315c9577344853d271ae15
Submitter: Jenkins
Branch: master

commit 919af096f79c8861ee315c9577344853d271ae15
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500

    do not configure worker specific items in init

    this further corrects some issues with coordination. the basic issue
    is described in a bug i fixed a while ago[1]. basically everything
    defined in init is sort shared by workers. if it's unique to a worker
    it should not be defined in init.

    the weird addition to stop() is to maintain bug1418793[2]

    [1] I2ad05e2085c0c0f78653c6354d301d18b8dee121
    [2] Ied2f086e1f50950b430095ae7ee89036fd4a89d9

    Change-Id: I979fdcd350c9a3a0fb6c02e831f1a450133ef76a
    Closes-Bug: #1533787

Changed in ceilometer:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/272693
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=67e47cda8e7e0d2649fef334a6e0db2826d5fbd1
Submitter: Jenkins
Branch: stable/liberty

commit 67e47cda8e7e0d2649fef334a6e0db2826d5fbd1
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500

    better support notification coordination

    when launching multiple agents at same time, there is a chance that
    agents will miss the registry of another agent. this is possible
    because there is a lot of overhead involved when starting up agents,
    specifically with initialising managers.

    this change makes it so the agent only joins the group AFTER it has
    done all setup that does not require coordination. after it joins,
    we start listening right away for other changes to group membership

    additionally, this adds a lock to pipeline queue setup so only one
    event at any time can trigger a reconfiguration.

    Change-Id: I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
    Closes-Bug: #1533787

This issue was fixed in the openstack/ceilometer 6.0.0.0b3 development milestone.

This issue was fixed in the openstack/ceilometer 5.0.3 release.

Reviewed: https://review.openstack.org/274069
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=7dbaf4c207253d24ba7a9c9f62996562a1800098
Submitter: Jenkins
Branch: stable/liberty

commit 7dbaf4c207253d24ba7a9c9f62996562a1800098
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500

    do not configure worker specific items in init

    this further corrects some issues with coordination. the basic issue
    is described in a bug i fixed a while ago[1]. basically everything
    defined in init is sort shared by workers. if it's unique to a worker
    it should not be defined in init.

    the weird addition to stop() is to maintain bug1418793[2]

    [1] I2ad05e2085c0c0f78653c6354d301d18b8dee121
    [2] Ied2f086e1f50950b430095ae7ee89036fd4a89d9

    Change-Id: I979fdcd350c9a3a0fb6c02e831f1a450133ef76a
    Closes-Bug: #1533787

This issue was fixed in the openstack/ceilometer 5.0.4 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers