Bug #1533787 “workload_partitioning inconsistently reports group...” : Bugs : Ceilometer

Revision history for this message

gordon chung (chungg) wrote on 2016-01-13:

#1

there appears to be some race happening. basically if you start multiple agents at the same time. there is a chance that one join will overwrite another join. because of this, members are inadvertently removed from group.

Changed in ceilometer:
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → gordon chung (chungg)

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-01-14:

#2

Joining a group with the memcached or redis driver is atomic, so that shouldn't be happening. We really need more info at this point.

Changed in python-tooz:
status:	New → Incomplete

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-14: Fix proposed to ceilometer (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/267625

Changed in ceilometer:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix merged to ceilometer (master)

#4

Reviewed: https://review.openstack.org/267625
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=e84a10882a9b682ff41c84e8bf4ee2497e7e7a31
Submitter: Jenkins
Branch: master

commit e84a10882a9b682ff41c84e8bf4ee2497e7e7a31
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500

better support notification coordination

    when launching multiple agents at same time, there is a chance that
    agents will miss the registry of another agent. this is possible
    because there is a lot of overhead involved when starting up agents,
    specifically with initialising managers.

    this change makes it so the agent only joins the group AFTER it has
    done all setup that does not require coordination. after it joins,
    we start listening right away for other changes to group membership

additionally, this adds a lock to pipeline queue setup so only one
event at any time can trigger a reconfiguration.

Change-Id: I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
Closes-Bug: #1533787

Changed in ceilometer:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-26: Fix proposed to ceilometer (stable/liberty)

#5

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/272693

Revision history for this message

Rohit Jaiswal (rohit-jaiswal-3) wrote on 2016-01-28:

#6

Can this issue also occur with the ZK backend?

In our deployment, we enable workload_partitioning with the below conf:

check_watchers=10.0
heartbeat=1.0

But we dont see any samples being published, ls /tooz/ceilometer.notification in the ZK shell returns an empty set and grep for pipeline consumers/listeners returns nothing from rabbit.

Revision history for this message

gordon chung (chungg) wrote on 2016-01-28:

#7

hey Rohit,

i'm still trying to test this. going to try and test with just tooz... do you see consumers on main service queues?

one interesting test run i've seen is: http://logs.openstack.org/92/268292/4/check/gate-tempest-dsvm-ceilometer-mysql-neutron-full/a7ee5b0/logs/screen-ceilometer-anotification.txt.gz?level=INFO

basically both agents failed to join group. https://review.openstack.org/#/c/268292/4 and so no one is processing the pipeline queues

Revision history for this message

Rohit Jaiswal (rohit-jaiswal-3) wrote on 2016-01-28:

#8

gordc: i see consumers for main queues but not for pipeline queues, will having a retry in the member retrieval logic help here? We could max out a configurable number of retries before raising the tooz.coordination.MemberNotJoined Exception.

Revision history for this message

gordon chung (chungg) wrote on 2016-01-28:

#9

hmm.. this is weirder:

http://logs.openstack.org/53/273253/2/gate/gate-heat-dsvm-functional-convg-mysql/faab21e/logs/screen-ceilometer-anotification.txt.gz

1st worker joins sees itself:
http://logs.openstack.org/28/272728/3/check/gate-tempest-dsvm-ceilometer-mongodb-full/bd2cab3/logs/screen-ceilometer-anotification.txt.gz#_2016-01-28_18_46_36_735

2nd worker joins, sees only itself (not 1st agent):
http://logs.openstack.org/28/272728/3/check/gate-tempest-dsvm-ceilometer-mongodb-full/bd2cab3/logs/screen-ceilometer-anotification.txt.gz#_2016-01-28_18_46_37_268

1st worker tries to pull tasks again and finds nothing:
http://logs.openstack.org/28/272728/3/check/gate-tempest-dsvm-ceilometer-mongodb-full/bd2cab3/logs/screen-ceilometer-anotification.txt.gz#_2016-01-28_18_46_39_870

2nd worker tries to pull tasks again and finds nothing:
http://logs.openstack.org/28/272728/3/check/gate-tempest-dsvm-ceilometer-mongodb-full/bd2cab3/logs/screen-ceilometer-anotification.txt.gz#_2016-01-28_18_46_39_872

Revision history for this message

gordon chung (chungg) wrote on 2016-01-28:

#10

sigh. i think it's something related to this: https://review.openstack.org/#/c/83140/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-28: Fix proposed to ceilometer (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/273792

gordon chung (chungg) on 2016-01-29

Changed in ceilometer:
status:	Fix Released → In Progress

gordon chung (chungg) on 2016-01-29

Changed in python-tooz:
status:	Incomplete → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-29: Fix proposed to ceilometer (stable/liberty)

#12

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/274069

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-29: Fix proposed to ceilometer (stable/kilo)

#13

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/274070

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-03: Change abandoned on ceilometer (stable/kilo)

#14

Change abandoned by gordon chung (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/274070

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-04: Fix merged to ceilometer (master)

#15

Reviewed: https://review.openstack.org/273792
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=919af096f79c8861ee315c9577344853d271ae15
Submitter: Jenkins
Branch: master

commit 919af096f79c8861ee315c9577344853d271ae15
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500

do not configure worker specific items in init

    this further corrects some issues with coordination. the basic issue
    is described in a bug i fixed a while ago[1]. basically everything
    defined in init is sort shared by workers. if it's unique to a worker
    it should not be defined in init.

the weird addition to stop() is to maintain bug1418793[2]

[1] I2ad05e2085c0c0f78653c6354d301d18b8dee121
[2] Ied2f086e1f50950b430095ae7ee89036fd4a89d9

Change-Id: I979fdcd350c9a3a0fb6c02e831f1a450133ef76a
Closes-Bug: #1533787

Changed in ceilometer:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-02-09: Fix merged to ceilometer (stable/liberty)

#16

Reviewed: https://review.openstack.org/272693
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=67e47cda8e7e0d2649fef334a6e0db2826d5fbd1
Submitter: Jenkins
Branch: stable/liberty

commit 67e47cda8e7e0d2649fef334a6e0db2826d5fbd1
Author: gordon chung <email address hidden>
Date: Thu Jan 14 09:47:33 2016 -0500

better support notification coordination

    when launching multiple agents at same time, there is a chance that
    agents will miss the registry of another agent. this is possible
    because there is a lot of overhead involved when starting up agents,
    specifically with initialising managers.

    this change makes it so the agent only joins the group AFTER it has
    done all setup that does not require coordination. after it joins,
    we start listening right away for other changes to group membership

additionally, this adds a lock to pipeline queue setup so only one
event at any time can trigger a reconfiguration.

Change-Id: I8100160a3aa83a190c4110e6e8be9b26aef8fd1c
Closes-Bug: #1533787

Revision history for this message

Thierry Carrez (ttx) wrote on 2016-03-03: Fix included in openstack/ceilometer 6.0.0.0b3

#17

This issue was fixed in the openstack/ceilometer 6.0.0.0b3 development milestone.

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-05-23: Fix included in openstack/ceilometer 5.0.3

#19

This issue was fixed in the openstack/ceilometer 5.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-30: Fix merged to ceilometer (stable/liberty)

#20

Reviewed: https://review.openstack.org/274069
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=7dbaf4c207253d24ba7a9c9f62996562a1800098
Submitter: Jenkins
Branch: stable/liberty

commit 7dbaf4c207253d24ba7a9c9f62996562a1800098
Author: gordon chung <email address hidden>
Date: Thu Jan 28 17:09:29 2016 -0500

do not configure worker specific items in init

    this further corrects some issues with coordination. the basic issue
    is described in a bug i fixed a while ago[1]. basically everything
    defined in init is sort shared by workers. if it's unique to a worker
    it should not be defined in init.

the weird addition to stop() is to maintain bug1418793[2]

[1] I2ad05e2085c0c0f78653c6354d301d18b8dee121
[2] Ied2f086e1f50950b430095ae7ee89036fd4a89d9

Change-Id: I979fdcd350c9a3a0fb6c02e831f1a450133ef76a
Closes-Bug: #1533787

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-07-07: Fix included in openstack/ceilometer 5.0.4

#21

This issue was fixed in the openstack/ceilometer 5.0.4 release.

Ceilometer

workload_partitioning inconsistently reports group members

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
Ceilometer	Fix Released	Critical	gordon chung
Liberty	Fix Committed	Undecided	gordon chung
tooz	Invalid	Undecided	Unassigned