Ubuntu
maas package

rabbitmq queue fills up and celery stops executing tasks when upload_dhcp_leases is done every minute (by default)

Bug #1153077 reported by Mark Stillwell on 2013-03-09

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Expired	High	Unassigned
	maas (Ubuntu)	Expired	High	Unassigned

Bug Description

With the default MAAS install I was having a problem where celery would become unresponsive and stop running new tasks (including powering compute nodes on/off) after a while, and eventually the system would have out of memory errors. The problem seems to have been that the queue was filling up with dhcp_upload_leases tasks (I checked on this by getting tasks out with the amqp tools...). This problem goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py is changed from 1 minute to 5 minutes.

Related branches

lp:~allenap/maas/dhcp-leases-parsing

Merged into lp:~maas-committers/maas/trunk at revision 2335

Raphaël Badin (community): Approve on 2014-05-06

lp:~allenap/maas/dhcp-leases-parsing-wire-up

Merged into lp:~maas-committers/maas/trunk at revision 2337

Gavin Panella (community): Approve on 2014-05-19

lp:~allenap/maas/dhcp-leases-parsing--1.5

Merged into lp:maas/1.5 at revision 2273

Gavin Panella (community): Approve on 2014-05-19

lp:~allenap/maas/dhcp-leases-parsing-wire-up--1.5

Merged into lp:maas/1.5 at revision 2274

Gavin Panella (community): Approve on 2014-05-19

Revision history for this message

Mark Stillwell (marklee) wrote on 2013-03-13:

As an update, the problem does not go away actually, it just takes longer to develop. So, for some reason a task is getting hung or something, causing all tasks behind it to not be executed and filling the queue.

Robie Basak (racb) on 2013-03-19

summary:

- rabbitmq queue fills up and salary stops executing tasks when
+ rabbitmq queue fills up and celery stops executing tasks when
upload_dhcp_leases is done every minute (by default)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-05-21:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status:	New → Confirmed

Revision history for this message

Erick Turnquist (jhujhiti) wrote on 2013-05-21:

I think I'm seeing the same issue. I have a celeryd process that appears to be stuck in a loop and a queue in the vhost /maas_workers named c1d352af-8060-4bf4-b820-f468b22a97fb. One thing that stands out to me is that the queue has 4 unacknowledged messages at all times. New ones seem to come in and get acked, but the 4 remain.

Here are a couple example messages. The content-type is application/x-python-serialize

(dp1
S'retries'
p2
I0
sS'task'
p3
S'provisioningserver.tasks.upload_dhcp_leases'
p4
sS'kwargs'
p5
(dp6
sS'eta'
p7
NsS'args'
p8
(lp9
sS'id'
p10
S'ac7c0755-43ab-40b8-bc39-5315236a3639'
p11
sS'expires'
p12
NsS'utc'
p13
I00
s.

------

(dp1
S'retries'
p2
I0
sS'task'
p3
S'provisioningserver.tasks.report_boot_images'
p4
sS'kwargs'
p5
(dp6
sS'eta'
p7
NsS'args'
p8
(lp9
sS'id'
p10
S'd5447c64-8317-4d13-a5ce-d1c818c2efc0'
p11
sS'expires'
p12
NsS'utc'
p13
I00
s.

Andres Rodriguez (andreserl) on 2013-05-21

Changed in maas (Ubuntu):
importance:	Undecided → Critical
Changed in maas:
importance:	Undecided → High
Changed in maas (Ubuntu):
importance:	Critical → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-05-24:

I suspect this is caused by task messages getting generated before a celeryd listener is connected/registered. I saw it once myself (but only once).

Changed in maas:
status:	New → Triaged
importance:	High → Critical

Revision history for this message

Raphaël Badin (rvb) wrote on 2013-05-24:

Adding a TTL to the celery queues will avoid the out of memory error. But if we want to fix the core problem we need to understand why the tasks are pilling up without being processed by the workers.

Revision history for this message

Raphaël Badin (rvb) wrote on 2013-05-24:

If someone can recreate this problem, it would be good to see in the celery logs (/var/log/maas/*celery*.log) if there isn't an indication of what's wrong with the workers.

Revision history for this message

Raphaël Badin (rvb) wrote on 2013-05-24:

> I suspect this is caused by task messages getting generated before a celeryd listener is connected/registered. I saw it once myself (but only once).

Did the tasks start piling up right from the start or did the system work for a while, then stopped working with the tasks piling up?

Revision history for this message

Raphaël Badin (rvb) wrote on 2013-05-24:

This is pretty difficult to diagnose without access to a system that shows the problem… If you're seeing this bug, could you please run a couple of commands and paste the results:

sudo rabbitmqctl list_queues -p /maas_workers name messages_unacknowledged messages memory
service maas-cluster-celery status
service maas-region-celery status
# pastebin the content of /var/log/maas/celery.log
# pastebin the content of /var/log/maas/celery-region.log
# pastebin the content of /var/log/maas/maas.log

Revision history for this message

Mark Stillwell (marklee) wrote on 2013-05-24:

Unfortunately, due in part to this bug, we aren't currently using MAAS.

Revision history for this message

Erick Turnquist (jhujhiti) wrote on 2013-05-29:

#10

celery-logs.tar.gz Edit (911.6 KiB, application/x-tar)

I've just hit this again overnight. Logs are attached, command output is below.

$ sudo rabbitmqctl list_queues -p /maas_workers name messages_unacknowledged
Listing queues ...
c1d352af-8060-4bf4-b820-f468b22a97fb 4
celery 0
maas-juju.celeryd.pidbox 0
master 0
...done.
$ sudo service maas-cluster-celery status
maas-cluster-celery start/running, process 6499
$ sudo service maas-region-celery status
maas-region-celery start/running, process 6555

Revision history for this message

Gavin Panella (allenap) wrote on 2013-05-29:

#11

That maas.log from jhujhiti is packed with OAuthUnauthorized exceptions.

Right at the end of celery.log there's:

[2013-05-29 10:33:56,514: INFO/MainProcess] Got task from broker: provisioningserver.tasks.refresh_secrets[842cc6d7-7e55-4098-a31e-221b87b0b545]

and a little time later:

[2013-05-29 10:41:28,385: INFO/PoolWorker-3] Not sending DHCP leases to server: not all required knowledge received from server yet. Missing: api_credentials, nodegroup_uuid

which suggests that refresh_secrets has not fully worked.

Revision history for this message

Raphaël Badin (rvb) wrote on 2013-05-29:

#12

jhujhiti told me that the celery worker is eating up all the cpu and most of the memory.

In celery.log, I see: [2013-05-29 10:41:28,386: INFO/MainProcess] Task provisioningserver.tasks.upload_dhcp_leases[11a33416-5a33-4530-8fdf-e0825b62d616] succeeded in *451.886018991s*: None

I thought that it was a long time to upload the dhcp lease file but then jhujhiti told me that the lease file is 26M.

Revision history for this message

Erick Turnquist (jhujhiti) wrote on 2013-05-29:

#13

dhcpd.leases.gz Edit (106.5 KiB, application/octet-stream)

Leases file is attached. I am working on figuring out why there are so many abandoned leases, but hopefully MAAS can be made to not choke on such a large file.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-10-15:

#14

Have any you seen this again lately?

Revision history for this message

Mark Stillwell (marklee) wrote on 2013-10-15: Re: [Bug 1153077] Re: rabbitmq queue fills up and celery stops executing tasks when upload_dhcp_leases is done every minute (by default)

#15

I haven't used MAAS for about 6 months precisely because of this
issue. It may have been fixed in the interim.

On Tue, Oct 15, 2013 at 6:43 AM, Julian Edwards
<email address hidden> wrote:
> Have any you seen this again lately?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1153077
>
> Title:
> rabbitmq queue fills up and celery stops executing tasks when
> upload_dhcp_leases is done every minute (by default)
>
> Status in MAAS:
> Triaged
> Status in “maas” package in Ubuntu:
> Confirmed
>
> Bug description:
> With the default MAAS install I was having a problem where celery
> would become unresponsive and stop running new tasks (including
> powering compute nodes on/off) after a while, and eventually the
> system would have out of memory errors. The problem seems to have been
> that the queue was filling up with dhcp_upload_leases tasks (I checked
> on this by getting tasks out with the amqp tools...). This problem
> goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py
> is changed from 1 minute to 5 minutes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1153077/+subscriptions

--
Mark Lee Stillwell
<email address hidden>

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2013-10-15:

#16

On Tuesday 15 Oct 2013 08:48:08 you wrote:
> I haven't used MAAS for about 6 months precisely because of this
> issue. It may have been fixed in the interim.

Are you able to try it again? I have not seen the problem in a long time,
although it may have been a RabbitMQ bug in precise.

Revision history for this message

Mark Stillwell (marklee) wrote on 2013-10-15:

#17

For the moment at least the system I had tried to install MAAS on is
in use. I may look at it again during the winter break.

On Tue, Oct 15, 2013 at 10:01 AM, Julian Edwards
<email address hidden> wrote:
> On Tuesday 15 Oct 2013 08:48:08 you wrote:
>> I haven't used MAAS for about 6 months precisely because of this
>> issue. It may have been fixed in the interim.
>
> Are you able to try it again? I have not seen the problem in a long time,
> although it may have been a RabbitMQ bug in precise.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1153077
>
> Title:
> rabbitmq queue fills up and celery stops executing tasks when
> upload_dhcp_leases is done every minute (by default)
>
> Status in MAAS:
> Triaged
> Status in “maas” package in Ubuntu:
> Confirmed
>
> Bug description:
> With the default MAAS install I was having a problem where celery
> would become unresponsive and stop running new tasks (including
> powering compute nodes on/off) after a while, and eventually the
> system would have out of memory errors. The problem seems to have been
> that the queue was filling up with dhcp_upload_leases tasks (I checked
> on this by getting tasks out with the amqp tools...). This problem
> goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py
> is changed from 1 minute to 5 minutes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1153077/+subscriptions

--
Mark Lee Stillwell
<email address hidden>

Revision history for this message

Gavin Panella (allenap) wrote on 2013-10-15:

#18

Erick, thanks for uploading your leases file. I've improved the code so that it parses it in less than a couple of seconds, and there's plenty of scope for improving that. I'll try to polish it and land it later this week.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-01-10:

#19

Downgrading - nobody I know has experienced this lately.

Changed in maas:
importance:	Critical → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-02-25:

#20

I'm going to close this bug unless anyone can confirm it's still a problem? Nobody I know has seen this happen for at least the whole of the last development cycle.

Changed in maas:
status:	Triaged → Incomplete
Changed in maas (Ubuntu):
status:	Confirmed → Incomplete

Revision history for this message

Gavin Panella (allenap) wrote on 2014-02-25:

#21

I've recently seen a somewhat similar situation:

- Celery went down for a while.

- Celerybeat jobs accumulated in RabbitMQ.

- When Celery came back up, it spun manically, going through hundreds of old jobs.

However, neither Celery nor RabbitMQ showed signs of falling over. They were just busy.

It's a separate bug, but we should ensure that Celerybeat jobs have a very short lifetime. Maybe RabbitMQ or Celery have a mechanism for ignoring old jobs. If not, we can add a timestamp check to the job implementations themselves.

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-02-25:

#22

> Maybe RabbitMQ or Celery have a mechanism for ignoring old jobs. If not, we can add a timestamp check to the job
> implementations themselves.

That's a good point, there is support in Celery for this:
http://celery.github.io/celery/userguide/executing.html#expiration

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-02-25:

#23

On Tuesday 25 Feb 2014 10:17:25 you wrote:
> - Celery went down for a while.
>
> - Celerybeat jobs accumulated in RabbitMQ.

That's weird, the beat stuff is integrated into the celeryd isn't it?

> It's a separate bug, but we should ensure that Celerybeat jobs have a
> very short lifetime. Maybe RabbitMQ or Celery have a mechanism for
> ignoring old jobs. If not, we can add a timestamp check to the job
> implementations themselves.

Yep, we need some protection against this.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-04-27:

#24

[Expired for maas (Ubuntu) because there has been no activity for 60 days.]

Changed in maas (Ubuntu):
status:	Incomplete → Expired

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-04-27:

#25

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntumaas package

rabbitmq queue fills up and celery stops executing tasks when upload_dhcp_leases is done every minute (by default)

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
maas package