notification agent: a lot of partition coordinator warnings

Bug #1566887 reported by gordon chung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Fix Released
Medium
Mehdi Abaakouk

Bug Description

following warning is happening pretty consistently.

this is with workload partitioning on, 10 pipeline_queues.

2016-04-06 10:08:12.900 13227 WARNING oslo.service.loopingcall [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Function 'ceilometer.coordination.PartitionCoordinator.run_watchers' run outlasted interval by 109.66 sec
2016-04-06 10:08:12.914 13227 WARNING ceilometer.coordination [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Cannot extract tasks because agent failed to join group properly. Rejoining group.
2016-04-06 10:08:12.925 13227 INFO ceilometer.coordination [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Joined partitioning group ceilometer.notification
2016-04-06 10:11:12.875 13228 WARNING oslo.service.loopingcall [req-141a4468-9459-4ea0-84f5-da3ee6c03e1a - - - - -] Function 'ceilometer.coordination.PartitionCoordinator.run_watchers' run outlasted interval by 160.03 sec
2016-04-06 10:11:13.120 13228 WARNING ceilometer.coordination [req-141a4468-9459-4ea0-84f5-da3ee6c03e1a - - - - -] Cannot extract tasks because agent failed to join group properly. Rejoining group.
2016-04-06 10:11:13.170 13228 INFO ceilometer.coordination [req-141a4468-9459-4ea0-84f5-da3ee6c03e1a - - - - -] Joined partitioning group ceilometer.notification
2016-04-06 10:13:33.042 13227 WARNING oslo.service.loopingcall [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Function 'ceilometer.coordination.PartitionCoordinator.run_watchers' run outlasted interval by 120.08 sec
2016-04-06 10:13:33.051 13227 WARNING ceilometer.coordination [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Cannot extract tasks because agent failed to join group properly. Rejoining group.
2016-04-06 10:13:33.060 13227 INFO ceilometer.coordination [req-7e2ed8f3-fb96-4425-89e1-915f3f2cf662 - - - - -] Joined partitioning group ceilometer.notification

Revision history for this message
gordon chung (chungg) wrote :

Name: tooz
Version: 1.34.0

Revision history for this message
gordon chung (chungg) wrote :

Name: oslo.messaging
Version: 4.6.1

Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

This is because when a new process/member join the group, it will re-distribute the tasks to all members, then the old members need to kill some listeners because task is reduced

however, we kill a listener by calling a blocking method wait(), which becomes worse if that listener hold some resources to clear, in my devstack env, with no load at all, some listeners take dozens of seconds to exit.

the heartbeat has no chance to execute during the process killing the listeners, hence it will leave the group automatically, after all listeners exit, it will find itself doesn't in the group, so it will rejoin the step, which back into the first stage .....

Changed in ceilometer:
assignee: nobody → ZhiQiang Fan (aji-zqfan)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/304054

Changed in ceilometer:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/308190

Changed in ceilometer:
assignee: ZhiQiang Fan (aji-zqfan) → Mehdi Abaakouk (sileht)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ceilometer (master)

Change abandoned by ZhiQiang Fan (<email address hidden>) on branch: master
Review: https://review.openstack.org/304054
Reason: better solution is on the way

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/308190
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=24f75dae1219e8b82ffe26571a5ebe3e6b746701
Submitter: Jenkins
Branch: master

commit 24f75dae1219e8b82ffe26571a5ebe3e6b746701
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Apr 20 09:33:25 2016 +0200

    notification: Remove eventlet timers

    This change removes usage of eventlet timers.

    This allows coordinator heartbeat/watchers to work correctly when
    the main thread is stuck for any reason (IO, time.sleep, ...).

    Change-Id: I847aebb0d0166c2b46505061a15a06e3ce1b5eb2
    Closes-Bug: #1566887

Changed in ceilometer:
status: In Progress → Fix Released
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/ceilometer 7.0.0.0b2

This issue was fixed in the openstack/ceilometer 7.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.