notification agent never idles

Bug #1478135 reported by gordon chung
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Won't Fix
Medium
Unassigned
oslo.messaging
Expired
Undecided
Unassigned

Bug Description

running current master, with no polling agents. the notification agent maintains a constant CPU load which hovers around 6-12% CPU usage... nothing is on the queue so i'm not sure what is using CPU.

Revision history for this message
Chris Dent (cdent) wrote :

Something, unclear thus far what, is causing epoll_wait() in a very tight loop, rather than blocking.

Revision history for this message
Chris Dent (cdent) wrote :
Revision history for this message
gordon chung (chungg) wrote :

the strange thing is that it idles at 0% to 1% after startup. only after it receives a sample does it jump and hover around ~10%

Revision history for this message
gordon chung (chungg) wrote :

i don't think it's entirely the listeners... the collector has listeners as well (one per worker) and it does not idle at above 1%.

when i turn off coordination, the notification agent will essentially have the same amount of listeners per agent but it will hover at 10%.

gordon chung (chungg)
Changed in ceilometer:
importance: Undecided → High
Revision history for this message
gordon chung (chungg) wrote :

this relates to either pipeline or notifiers. if i disable events (and the event pipeline), the CPU load drops and hovers around 5%... if i disable pipeline manager but keep main queue listeners, the cpu load hovers at/near zero.

so its seems because the pipelines are disabled or because the notifiers are disabled...

Revision history for this message
gordon chung (chungg) wrote :

taking a break... i have no idea what is happening.

my latest findings:

when i disable publishing with everything else as master (https://github.com/openstack/ceilometer/blob/master/ceilometer/publisher/messaging.py#L216-L217), everything seems fine.

but no matter what publisher i use: notifier, rpc, udp, once it sends something it'll start consuming CPU

when i change to a single pipeline instead of default 4, it will lower the idle CPU load... but will still idle.

Revision history for this message
Aaron DH (aaron.d) wrote :

first, i am a newer in ceilometer,

i think that pipeline is loaded when startup,
but when no samples , pipeline do nothing.

maybe we can try to stop the listeners to see the trends of cpu usage

Revision history for this message
gordon chung (chungg) wrote :

welcome Aaron!

thanks for suggestion... let me try to kill listeners after it publishes a few items to see if anything changes.

Revision history for this message
gordon chung (chungg) wrote :

adding oslo.messaging.

after more playing, i've noticed this:
if a notifier is enabled at any point in notification agent (whether it's notifier/rpc publishers or workload_partitioning), the notification agent will constantly consume cpu.

BUT
if a notifier is not enabled at any point in notification agent (udp/test publisher and no workload partitioning) all is fine.

i'm not sure why you can't have a listener and notifier in the same service without having it consume constant cpu... my only guess right now is something to do with a shared transport?

Revision history for this message
Kun Huang (academicgareth) wrote :

Hi gordon, you could use https://github.com/brendangregg/FlameGraph to see where cpu time cost :)

gordon chung (chungg)
Changed in ceilometer:
status: New → Triaged
status: Triaged → New
Revision history for this message
gordon chung (chungg) wrote :

related to heartbeat_timeout_threshold. when set to zero, agent idles fine.

Changed in ceilometer:
importance: High → Medium
status: New → Triaged
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Gordon, Is this still a problem? what would be the right fix?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
gordon chung (chungg) wrote :

@dims, i'm not sure if the above is the right fix but setting heartbeat_timeout_threshold = 0 does fix the issue. i was originally told about timeout to improve listener shutdown performance[1]. i just happened to realise that it also fixed the idling issue.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-ceilometer/%23openstack-ceilometer.2015-09-16.log.html#t2015-09-16T14:43:28

Revision history for this message
Dean Daskalantonakis (ddaskal) wrote :

I recommend that we close this bug as "setting heartbeat_timeout_threshold = 0 does fix the issue" and there is no code to commit.

Revision history for this message
gordon chung (chungg) wrote :

sure. we can reopen if not.

to fix issue, do this: https://bugs.launchpad.net/ceilometer/+bug/1478135/comments/13

Changed in ceilometer:
status: Triaged → Won't Fix
Changed in oslo.messaging:
status: Incomplete → Invalid
Revision history for this message
gordon chung (chungg) wrote :

re-opening since it was brought up that maybe we should return the default for heartbeat_timeout_threshold back to 0

Changed in oslo.messaging:
status: Invalid → Confirmed
Revision history for this message
Steve Lewis (steve-lewis) wrote :

subscribed as interested in the root cause and ensuring the right defaults are in use

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Does this still happen? *please* let us know so we can fix it for Mitaka.

Revision history for this message
gordon chung (chungg) wrote :

this doesn't seem to be present anymore in oslo.messaging 4.5.0. it won't idle completely but it's not constantly spiking to ~10%CPU... it'll randomly spike to ~1-2%

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Looks like changes in eventlet (0.18.x) have helped, let's mark it as incomplete for now till we get a recreate

Changed in oslo.messaging:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for oslo.messaging because there has been no activity for 60 days.]

Changed in oslo.messaging:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.