Ceilometer

notification agent never idles

Bug #1478135 reported by gordon chung on 2015-07-24

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Won't Fix	Medium	Unassigned
	oslo.messaging	Expired	Undecided	Unassigned

Bug Description

running current master, with no polling agents. the notification agent maintains a constant CPU load which hovers around 6-12% CPU usage... nothing is on the queue so i'm not sure what is using CPU.

Revision history for this message

Chris Dent (cdent) wrote on 2015-07-28:

Something, unclear thus far what, is causing epoll_wait() in a very tight loop, rather than blocking.

Revision history for this message

Chris Dent (cdent) wrote on 2015-07-28:

It's the notification listeners:

https://github.com/openstack/ceilometer/blob/aec0d6df43c07386dc815a803e36c9aba6e8a3e1/ceilometer/notification.py#L217-L219

Revision history for this message

gordon chung (chungg) wrote on 2015-07-28:

the strange thing is that it idles at 0% to 1% after startup. only after it receives a sample does it jump and hover around ~10%

Revision history for this message

gordon chung (chungg) wrote on 2015-07-28:

i don't think it's entirely the listeners... the collector has listeners as well (one per worker) and it does not idle at above 1%.

when i turn off coordination, the notification agent will essentially have the same amount of listeners per agent but it will hover at 10%.

gordon chung (chungg) on 2015-07-30

Changed in ceilometer:
importance:	Undecided → High

Revision history for this message

gordon chung (chungg) wrote on 2015-07-30:

this relates to either pipeline or notifiers. if i disable events (and the event pipeline), the CPU load drops and hovers around 5%... if i disable pipeline manager but keep main queue listeners, the cpu load hovers at/near zero.

so its seems because the pipelines are disabled or because the notifiers are disabled...

Revision history for this message

gordon chung (chungg) wrote on 2015-07-30:

taking a break... i have no idea what is happening.

my latest findings:

when i disable publishing with everything else as master (https://github.com/openstack/ceilometer/blob/master/ceilometer/publisher/messaging.py#L216-L217), everything seems fine.

but no matter what publisher i use: notifier, rpc, udp, once it sends something it'll start consuming CPU

when i change to a single pipeline instead of default 4, it will lower the idle CPU load... but will still idle.

Revision history for this message

Aaron DH (aaron.d) wrote on 2015-08-01:

first, i am a newer in ceilometer,

i think that pipeline is loaded when startup,
but when no samples , pipeline do nothing.

maybe we can try to stop the listeners to see the trends of cpu usage

Revision history for this message

gordon chung (chungg) wrote on 2015-08-02:

welcome Aaron!

thanks for suggestion... let me try to kill listeners after it publishes a few items to see if anything changes.

Revision history for this message

gordon chung (chungg) wrote on 2015-08-11:

adding oslo.messaging.

after more playing, i've noticed this:
if a notifier is enabled at any point in notification agent (whether it's notifier/rpc publishers or workload_partitioning), the notification agent will constantly consume cpu.

BUT
if a notifier is not enabled at any point in notification agent (udp/test publisher and no workload partitioning) all is fine.

i'm not sure why you can't have a listener and notifier in the same service without having it consume constant cpu... my only guess right now is something to do with a shared transport?

Revision history for this message

Kun Huang (academicgareth) wrote on 2015-09-02:

#10

Hi gordon, you could use https://github.com/brendangregg/FlameGraph to see where cpu time cost :)

gordon chung (chungg) on 2015-09-21

Changed in ceilometer:
status:	New → Triaged
status:	Triaged → New

Revision history for this message

gordon chung (chungg) wrote on 2015-10-05:

#11

related to heartbeat_timeout_threshold. when set to zero, agent idles fine.

Changed in ceilometer:
importance:	High → Medium
status:	New → Triaged

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-10-20:

#12

Gordon, Is this still a problem? what would be the right fix?

Changed in oslo.messaging:
status:	New → Incomplete

Revision history for this message

gordon chung (chungg) wrote on 2015-10-20:

#13

@dims, i'm not sure if the above is the right fix but setting heartbeat_timeout_threshold = 0 does fix the issue. i was originally told about timeout to improve listener shutdown performance[1]. i just happened to realise that it also fixed the idling issue.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-ceilometer/%23openstack-ceilometer.2015-09-16.log.html#t2015-09-16T14:43:28

Revision history for this message

Dean Daskalantonakis (ddaskal) wrote on 2016-01-04:

#14

I recommend that we close this bug as "setting heartbeat_timeout_threshold = 0 does fix the issue" and there is no code to commit.

Revision history for this message

gordon chung (chungg) wrote on 2016-01-04:

#15

sure. we can reopen if not.

to fix issue, do this: https://bugs.launchpad.net/ceilometer/+bug/1478135/comments/13

Changed in ceilometer:
status:	Triaged → Won't Fix
Changed in oslo.messaging:
status:	Incomplete → Invalid

Revision history for this message

gordon chung (chungg) wrote on 2016-01-04:

#16

re-opening since it was brought up that maybe we should return the default for heartbeat_timeout_threshold back to 0

Changed in oslo.messaging:
status:	Invalid → Confirmed

Revision history for this message

Steve Lewis (steve-lewis) wrote on 2016-01-04:

#17

subscribed as interested in the root cause and ensuring the right defaults are in use

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-03-02:

#18

Does this still happen? *please* let us know so we can fix it for Mitaka.

Revision history for this message

gordon chung (chungg) wrote on 2016-03-03:

#19

this doesn't seem to be present anymore in oslo.messaging 4.5.0. it won't idle completely but it's not constantly spiking to ~10%CPU... it'll randomly spike to ~1-2%

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-03-03:

#20

Looks like changes in eventlet (0.18.x) have helped, let's mark it as incomplete for now till we get a recreate

Changed in oslo.messaging:
status:	Confirmed → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-01-30:

#21

[Expired for oslo.messaging because there has been no activity for 60 days.]

Changed in oslo.messaging:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.