notification agent does not refresh

Bug #1729617 reported by gordon chung
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Critical
gordon chung
ceilometer (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

when we switch to partitioning in tooz, we broke the refreshing of notification agents to pick up new pipelines

Changed in ceilometer:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/517337
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=75cc518c2f86afd02a7e60df150148c8a0f2e813
Submitter: Zuul
Branch: master

commit 75cc518c2f86afd02a7e60df150148c8a0f2e813
Author: gord chung <email address hidden>
Date: Thu Nov 2 14:49:00 2017 +0000

    refresh agent if group membership changes

    this broke when we switched to tooz partitioner
    - ensure we trigger refresh if group changes
    - ensure we have heartbeat or else members will just die.

    - remove retain_common_targets tests because it doesn't make sense.
    it was originally designed for when we had listener per pipeline
    but that was changed 726b2d4d67ada3df07f36ecfd81b0cf72881e159
    - remove testing workload partitioning path in standard notification
    agent tests
    - correct test_unique test to properly validate a single target
    rather than the number of listeners we have.
    - add test to ensure group_state is updated when a member joins
    - add test to verify that listener assigned topics based on hashring

    Closes-Bug: #1729617
    Change-Id: I5039c93e6845a148c24094f755a78870d49ec19f

Changed in ceilometer:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/518401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (stable/pike)

Reviewed: https://review.openstack.org/518401
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=124d03bf9d0a6628abf171698d94a7e17112f4ee
Submitter: Zuul
Branch: stable/pike

commit 124d03bf9d0a6628abf171698d94a7e17112f4ee
Author: gord chung <email address hidden>
Date: Thu Nov 2 14:49:00 2017 +0000

    refresh agent if group membership changes

    this broke when we switched to tooz partitioner
    - ensure we trigger refresh if group changes
    - ensure we have heartbeat or else members will just die.

    - remove retain_common_targets tests because it doesn't make sense.
    it was originally designed for when we had listener per pipeline
    but that was changed 726b2d4d67ada3df07f36ecfd81b0cf72881e159
    - remove testing workload partitioning path in standard notification
    agent tests
    - correct test_unique test to properly validate a single target
    rather than the number of listeners we have.
    - add test to ensure group_state is updated when a member joins
    - add test to verify that listener assigned topics based on hashring

    Closes-Bug: #1729617
    Change-Id: I5039c93e6845a148c24094f755a78870d49ec19f
    (cherry picked from commit 75cc518c2f86afd02a7e60df150148c8a0f2e813)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 9.0.2

This issue was fixed in the openstack/ceilometer 9.0.2 release.

Revision history for this message
György Szombathelyi (gyurco) wrote :

Applied the patch to 9.0.1, but still more than one consumers in some queues:

Two agents are running:

rabbitmqctl list_queues name consumers messages_ready messages_unacknowledged -p ceilometer
Listing queues ...
ceilometer-pipe-cpu_source:cpu_sink-0.sample 2 0 501
ceilometer-pipe-cpu_source:cpu_sink-1.sample 1 0 40
ceilometer-pipe-cpu_source:cpu_sink-2.sample 1 0 546
ceilometer-pipe-cpu_source:cpu_sink-3.sample 1 0 273
ceilometer-pipe-cpu_source:cpu_sink-4.sample 1 0 20
ceilometer-pipe-cpu_source:cpu_sink-5.sample 1 0 312
ceilometer-pipe-cpu_source:cpu_sink-6.sample 2 0 385
ceilometer-pipe-cpu_source:cpu_sink-7.sample 2 0 424
ceilometer-pipe-cpu_source:cpu_sink-8.sample 1 0 32
ceilometer-pipe-cpu_source:cpu_sink-9.sample 2 0 139

Three agents:
rabbitmqctl list_queues name consumers messages_ready messages_unacknowledged -p ceilometer
Listing queues ...
ceilometer-pipe-cpu_source:cpu_sink-0.sample 2 0 505
ceilometer-pipe-cpu_source:cpu_sink-1.sample 1 0 40
ceilometer-pipe-cpu_source:cpu_sink-2.sample 1 0 560
ceilometer-pipe-cpu_source:cpu_sink-3.sample 1 0 280
ceilometer-pipe-cpu_source:cpu_sink-4.sample 1 0 20
ceilometer-pipe-cpu_source:cpu_sink-5.sample 2 0 320
ceilometer-pipe-cpu_source:cpu_sink-6.sample 2 0 388
ceilometer-pipe-cpu_source:cpu_sink-7.sample 3 0 31
ceilometer-pipe-cpu_source:cpu_sink-8.sample 1 0 32
ceilometer-pipe-cpu_source:cpu_sink-9.sample 3 0 41

In Ocata, there were only 1 consumers/queue.

Revision history for this message
gordon chung (chungg) wrote :

do you have [coordination]/check_watchers set? it will basically resync every check_watchers seconds.

i have redis as my tooz coordinator and it redistributes to one consumer per queue ~1min after start up

Revision history for this message
György Szombathelyi (gyurco) wrote :

No, I don't have it, but as I see, it is 10.0 (seconds?) by default.

Revision history for this message
György Szombathelyi (gyurco) wrote :

Ok, seems if the queues are not having a long backlog, then they'll sort them out finally.

Revision history for this message
gordon chung (chungg) wrote :

it is 10s by default but for everything to be fully sync'd it sometimes takes more than a single 10s cycle.

just to be clear when you say "they'll sort them out finally". does it eventually become 1 consumer per queue? i did notice you did not set:

[oslo_messaging_rabbit]
rabbit_qos_prefetch_count = 256

this means, you're grabbing the entire queue (even though you only process 'batch_size' messages)
this will cause significant memory usage if your system is backed up, and it is probably why your system didn't redistribute consumer. it first handles all the messages it grabbed (entire queue without prefretch) and then redistributes to avoid losing messages.

affects: ceilometer → ceilometer (Ubuntu)
affects: ceilometer (Ubuntu) → ceilometer
Revision history for this message
György Szombathelyi (gyurco) wrote :

No, I did not set rabbit_qos_prefetch_count = 256, good to know about it :)
But it became 1 consumer/queue, since the queues was fully consumed after a while (during experimenting with the bug #1729865).

Revision history for this message
György Szombathelyi (gyurco) wrote :

Well, I had something in my long-term memory about the prefetch count and the bug #1551667, and that was it:
https://review.openstack.org/#/c/385079/
So it seems, it is no longer necessary to set the prefetch count. This code is still in oslo.messaging.

Revision history for this message
James Page (james-page) wrote :

Marking Ubuntu task as Invalid; Ubuntu will pickup any changes in Ceilometer through the current development release and any stable point releases.

Changed in ceilometer (Ubuntu):
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ceilometer 10.0.0

This issue was fixed in the openstack/ceilometer 10.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.