Ceilometer

Collectors consuming all memory reading everything from the queue

Bug #1551667 reported by Luis Pigueiras on 2016-03-01

This bug affects 7 people

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	Undecided	gordon chung
	oslo.messaging	Fix Released	Undecided	Mehdi Abaakouk

Bug Description

Hello,

after upgrading my collectors in liberty and starting using oslo.messaging 2.5.0, the collectors started to consume all the RAM. After some deeper investigation, it seems that with the collectors in liberty + oslo.messaging 2.5, the collector consumes messages from the queue until it runs out of memory (bad if there are a lot of messages in the queue because my machines become unresponsive and cannot dispatch anything). If I try with ceilometer 5.0.0 + oslo.messaging 1.X, messages are consumed only until a certain point. Am I missing a new configuration option needed to not have this behavior or is it a bug?

Cheers,
Luis.

P.D: Not sure if this problem is for ceilometer or for oslo.messaging, so let me know if I have to move it somewhere else, please.

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2016-03-01:

Let's put this on ceilometer team first :)

Changed in oslo.messaging:
status:	New → Incomplete

Revision history for this message

Nadya Privalova (nprivalova) wrote on 2016-03-03:

I faced with similar problem with notification agents. I fixed that adding to [default]:

executor_thread_pool_size = 1

by default, this option equals to 64. @Luis, it would be great if you check that at your env. If it's not possible, I will try to reproduce this problem in my env

Revision history for this message

gordon chung (chungg) wrote on 2016-03-03:

do you have a constant backlog in your MQ?

this patch looks relevant: https://github.com/openstack/oslo.messaging/commit/c5a6bfdca30a5111e641ebe4b2eac40b21b8ce74

maybe try oslo 3.0.0?

Revision history for this message

Luis Pigueiras (lpigueiras) wrote on 2016-03-04:

@nprivalova Your suggestion doesn't solve anything in my case because it fills completely the RAM too (but slower than with 64). If for some reason the database is not working or it is too slow, I don't want the collector to read messages until it crashes...

@chungg I'm using RabbitMQ with the following configuration, so not sure that this commit will solve anything for me :(

[
  {rabbit, [
    {cluster_nodes, {['XXXXXX', 'YYYYYY'], disc}},
    {cluster_partition_handling, ignore},
    {tcp_listen_options,
         [binary,
         {packet, raw},
         {reuseaddr, true},
         {backlog, 128},
         {nodelay, true},
         {exit_on_close, false}]
    },
    {reverse_dns_lookups, true},
    {default_user, ...},
    {default_pass, ...}
  ]},
  {kernel, [
    {inet_dist_listen_max, 41055},
    {inet_dist_listen_min, 41055}
  ]}
,
  {rabbitmq_management, [
    {listener, [
      {port, XXXX}
    ]}
  ]}
].

Revision history for this message

gordon chung (chungg) wrote on 2016-03-08:

hi, so i'm asking whether you message queue has a constant backlog? which would mean the system would constantly be polling queues to pull messages

also i'm not exactly sure why your RabbitMQ configuratino would affect eventlet code.

Revision history for this message

Luis Pigueiras (lpigueiras) wrote on 2016-03-09:

Hello, I'm afraid I'm not understanding what you are asking me. The queue is growing because metering data is being generated and cannot be processed because the polling agents are trying to consume everything, filling completely the RAM and stopping the polling agents of dispatching data to the DB.

Sorry, I'm not an expert and maybe I'm missing something, what you mean that the system would constantly be polling the queues?

Revision history for this message

gordon chung (chungg) wrote on 2016-03-09:

the collector or notification agent is constantly checking queue when free to see if there are messages to consume -- sorry, didn't mean to confuse you with ceilometer polling agent.

i would still suggest you update your oslo.messaging to the latest supported version

gordon chung (chungg) on 2016-03-17

Changed in ceilometer:
status:	New → Incomplete

Revision history for this message

Aleš Křivák (aleskrivak) wrote on 2016-06-06:

I can confirm this problem exists with oslo.messaging 2.5.0 and 3.0.0. In my case there was network problem that prevented ceilometer-collector/oslo.messaging from processing messages for few days till I noticed it was down. After this period I have over one million messages waiting to be processed, but when trying to run collector, it will try to read all messages till OOM kills its process. I could simply purge queue with messages (its only testing environment), but this could be much more serious problem if that happened in production where you don't want to loose samples...

I use ceilometer 5.0.2 with RabbitMQ for messages (liberty). I tried oslo.messaging 2.5.0 (default version for liberty) and 3.0.0, I will also try 4.* versions, but I need to figure out how to deal with dependencies first.

Revision history for this message

Aleš Křivák (aleskrivak) wrote on 2016-06-07:

I was also abble to reproduce this problem with oslo.messaging 1.7.1 and 4.0.0 (I hadn't checked 4.1.0+ as there is too many dependency conflicts to test it with liberty).

I tried to examine this more closely and the problem is in fact not in the part of AMQP listener but in dispatcher (for version 2.5.0 notify/dispatcher.py), where there is no check on how many incoming messages are being processed.

This is what happens:
The message is obtained from listener and is processed by endpoint (in my case both mongo and gnocchi) executor, but since we use eventlet (and there is usually lot of space to interrupt processing of current "thread" in executors), the processing is given to another "thread" which tries to obtain yet another message. This results in a sequence get message - start processing message - get message - start processing message - ...all other messages in queue... - finish processing - aknowledge - finish processing - aknowledge; which obviously fills RAM when there is too many messages in the queue.

I managed to somehow fix this by adding simple counter into NotificationDispatcher.__call__ that keeps calling eventlet.sleep when there is already too many messages being processed.

I'm not sure wheter this bug is still present in current master - I haven't noticed anything in code that would deal with this situation, but the code in master has evolved a lot from 2.5.0 and I could easily overlooked something.

Changed in oslo.messaging:
status:	Incomplete → Confirmed

Revision history for this message

gordon chung (chungg) wrote on 2016-06-10:

#10

thanks for looking at this. we don't use eventlet driver as of Mitaka in Ceilometer. eventlet is completely removed in Ceilometer master

Revision history for this message

Dale Dude (daledude) wrote on 2016-08-28:

#11

I just had the same problem in mitaka with oslo.messaging 4.0.5. The metering.sample rabbit queue had a backlog of over 140000 messages because we didn't start the collector for about month. Once we did the server memory and swap was immediately consumed by the collector.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-08-28:

#12

Yeah the problem is that there is too many threads running at the same time by default. I'll take care of putting better default.

Changed in oslo.messaging:
assignee:	nobody → Julien Danjou (jdanjou)
Changed in ceilometer:
assignee:	nobody → Julien Danjou (jdanjou)

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2016-09-01:

#13

We have two scenarios with and without batch enabled in Ceilometer:

* Without I think the oslo.messaging rabbit_qos_prefetch_count default (0 read ahead a ton of messages) is not good for ceilometer. It's configurable only since Newton.
So even you reduce the executor threads if you queue is really big you can still have a high memory usage because of prefetching.

* With batch enabled oslo.messaging default are currently set as following:
executor thread = 1 and prefetch = batch_size, so memory usage should be good in this case.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-09-01:

#14

Mehdi, so what and where should we change default so we avoid this catastrophic scenario "by default"?

Changed in ceilometer:
status:	Incomplete → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-29: Fix proposed to ceilometer (master)

#15

Fix proposed to branch: master
Review: https://review.openstack.org/379149

Changed in ceilometer:
assignee:	Julien Danjou (jdanjou) → Jake Yip (waipengyip)
status:	Confirmed → In Progress

Revision history for this message

Jake Yip (waipengyip) wrote on 2016-09-29:

#16

Hi all, I've got a patch up that seems to solve this problem for us. Can anyone else verify if it works?

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-09-30:

#17

Where's your patch at?

Revision history for this message

Jake Yip (waipengyip) wrote on 2016-10-01:

#18

Hi Julien,

It's the fix propsed in comment #15, also at https://review.openstack.org/379149

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-04: Change abandoned on ceilometer (master)

#19

Change abandoned by Jake Yip (<email address hidden>) on branch: master
Review: https://review.openstack.org/379149
Reason: I've found that setting rabbit_qos_prefetch_count=x in [oslo_messaging_rabbit] limits the number of messages for the rabbit driver.

Abandoning this change, sorry for the trouble.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix proposed to oslo.messaging (master)

#20

Fix proposed to branch: master
Review: https://review.openstack.org/385079

Changed in oslo.messaging:
assignee:	Julien Danjou (jdanjou) → Mehdi Abaakouk (sileht)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-12: Fix merged to oslo.messaging (master)

#21

Reviewed: https://review.openstack.org/385079
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=c881baed29db49c5710795496cb07907e35ceaba
Submitter: Jenkins
Branch: master

commit c881baed29db49c5710795496cb07907e35ceaba
Author: Mehdi Abaakouk <email address hidden>
Date: Tue Oct 11 18:03:32 2016 +0200

rabbit: Don't prefetch when batch_size is set

    When the application set batch_size, so we don't need to prefetch more
    messages, especially for notifications. Notifications queues can be
    really big when the consumer have disapear during a long period, and
    when it come back, kombu/pyamqp will fetch all messages it can. So we
    override the qos prefetch value.

Change-Id: I601e10cf94310b9f96f7acb9942959aaafad7994
Closes-bug: #1551667

Changed in oslo.messaging:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-19: Fix included in openstack/oslo.messaging 5.11.0

#22

This issue was fixed in the openstack/oslo.messaging 5.11.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-02: Fix proposed to ceilometer (master)

#23

Fix proposed to branch: master
Review: https://review.openstack.org/392696

Changed in ceilometer:
assignee:	Jake Yip (waipengyip) → gordon chung (chungg)

gordon chung (chungg) on 2016-11-07

Changed in ceilometer:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-07: Change abandoned on ceilometer (master)

#24

Change abandoned by gordon chung (<email address hidden>) on branch: master
Review: https://review.openstack.org/392696
Reason: meh, i'll just close bug then.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1610284

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.