notification agent never idles

Bug #1478135 reported by gordon chung
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceilometer
Won't Fix
Medium
Unassigned
oslo.messaging
Expired
Undecided
Unassigned

Bug Description

running current master, with no polling agents. the notification agent maintains a constant CPU load which hovers around 6-12% CPU usage... nothing is on the queue so i'm not sure what is using CPU.

Revision history for this message
Chris Dent (cdent) wrote :

Something, unclear thus far what, is causing epoll_wait() in a very tight loop, rather than blocking.

Revision history for this message
Chris Dent (cdent) wrote :
Revision history for this message
gordon chung (chungg) wrote :

the strange thing is that it idles at 0% to 1% after startup. only after it receives a sample does it jump and hover around ~10%

Revision history for this message
gordon chung (chungg) wrote :

i don't think it's entirely the listeners... the collector has listeners as well (one per worker) and it does not idle at above 1%.

when i turn off coordination, the notification agent will essentially have the same amount of listeners per agent but it will hover at 10%.

gordon chung (chungg)
Changed in ceilometer:
importance: Undecided → High
Revision history for this message
gordon chung (chungg) wrote :

this relates to either pipeline or notifiers. if i disable events (and the event pipeline), the CPU load drops and hovers around 5%... if i disable pipeline manager but keep main queue listeners, the cpu load hovers at/near zero.

so its seems because the pipelines are disabled or because the notifiers are disabled...

Revision history for this message
gordon chung (chungg) wrote :

taking a break... i have no idea what is happening.

my latest findings:

when i disable publishing with everything else as master (https://github.com/openstack/ceilometer/blob/master/ceilometer/publisher/messaging.py#L216-L217), everything seems fine.

but no matter what publisher i use: notifier, rpc, udp, once it sends something it'll start consuming CPU

when i change to a single pipeline instead of default 4, it will lower the idle CPU load... but will still idle.

Revision history for this message
Aaron DH (aaron.d) wrote :

first, i am a newer in ceilometer,

i think that pipeline is loaded when startup,
but when no samples , pipeline do nothing.

maybe we can try to stop the listeners to see the trends of cpu usage

Revision history for this message
gordon chung (chungg) wrote :

welcome Aaron!

thanks for suggestion... let me try to kill listeners after it publishes a few items to see if anything changes.

Revision history for this message
gordon chung (chungg) wrote :

adding oslo.messaging.

after more playing, i've noticed this:
if a notifier is enabled at any point in notification agent (whether it's notifier/rpc publishers or workload_partitioning), the notification agent will constantly consume cpu.

BUT
if a notifier is not enabled at any point in notification agent (udp/test publisher and no workload partitioning) all is fine.

i'm not sure why you can't have a listener and notifier in the same service without having it consume constant cpu... my only guess right now is something to do with a shared transport?

Revision history for this message
Kun Huang (academicgareth) wrote :

Hi gordon, you could use https://github.com/brendangregg/FlameGraph to see where cpu time cost :)

gordon chung (chungg)
Changed in ceilometer:
status: New → Triaged
status: Triaged → New
Revision history for this message
gordon chung (chungg) wrote :

related to heartbeat_timeout_threshold. when set to zero, agent idles fine.

Changed in ceilometer:
importance: High → Medium
status: New → Triaged
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Gordon, Is this still a problem? what would be the right fix?

Changed in oslo.messaging:
status: New → Incomplete
Revision history for this message
gordon chung (chungg) wrote :

@dims, i'm not sure if the above is the right fix but setting heartbeat_timeout_threshold = 0 does fix the issue. i was originally told about timeout to improve listener shutdown performance[1]. i just happened to realise that it also fixed the idling issue.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-ceilometer/%23openstack-ceilometer.2015-09-16.log.html#t2015-09-16T14:43:28

Revision history for this message
Dean Daskalantonakis (ddaskal) wrote :

I recommend that we close this bug as "setting heartbeat_timeout_threshold = 0 does fix the issue" and there is no code to commit.

Revision history for this message
gordon chung (chungg) wrote :

sure. we can reopen if not.

to fix issue, do this: https://bugs.launchpad.net/ceilometer/+bug/1478135/comments/13

Changed in ceilometer:
status: Triaged → Won't Fix
Changed in oslo.messaging:
status: Incomplete → Invalid
Revision history for this message
gordon chung (chungg) wrote :

re-opening since it was brought up that maybe we should return the default for heartbeat_timeout_threshold back to 0

Changed in oslo.messaging:
status: Invalid → Confirmed
Revision history for this message
Steve Lewis (steve-lewis) wrote :

subscribed as interested in the root cause and ensuring the right defaults are in use

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Does this still happen? *please* let us know so we can fix it for Mitaka.

Revision history for this message
gordon chung (chungg) wrote :

this doesn't seem to be present anymore in oslo.messaging 4.5.0. it won't idle completely but it's not constantly spiking to ~10%CPU... it'll randomly spike to ~1-2%

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Looks like changes in eventlet (0.18.x) have helped, let's mark it as incomplete for now till we get a recreate

Changed in oslo.messaging:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for oslo.messaging because there has been no activity for 60 days.]

Changed in oslo.messaging:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers