rabbitmq queue fills up and celery stops executing tasks when upload_dhcp_leases is done every minute (by default)

Bug #1153077 reported by Mark Stillwell on 2013-03-09
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
High
Unassigned
maas (Ubuntu)
High
Unassigned

Bug Description

With the default MAAS install I was having a problem where celery would become unresponsive and stop running new tasks (including powering compute nodes on/off) after a while, and eventually the system would have out of memory errors. The problem seems to have been that the queue was filling up with dhcp_upload_leases tasks (I checked on this by getting tasks out with the amqp tools...). This problem goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py is changed from 1 minute to 5 minutes.

Related branches

Mark Stillwell (marklee) wrote :

As an update, the problem does not go away actually, it just takes longer to develop. So, for some reason a task is getting hung or something, causing all tasks behind it to not be executed and filling the queue.

Robie Basak (racb) on 2013-03-19
summary: - rabbitmq queue fills up and salary stops executing tasks when
+ rabbitmq queue fills up and celery stops executing tasks when
upload_dhcp_leases is done every minute (by default)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in maas (Ubuntu):
status: New → Confirmed
Erick Turnquist (jhujhiti) wrote :

I think I'm seeing the same issue. I have a celeryd process that appears to be stuck in a loop and a queue in the vhost /maas_workers named c1d352af-8060-4bf4-b820-f468b22a97fb. One thing that stands out to me is that the queue has 4 unacknowledged messages at all times. New ones seem to come in and get acked, but the 4 remain.

Here are a couple example messages. The content-type is application/x-python-serialize

(dp1
S'retries'
p2
I0
sS'task'
p3
S'provisioningserver.tasks.upload_dhcp_leases'
p4
sS'kwargs'
p5
(dp6
sS'eta'
p7
NsS'args'
p8
(lp9
sS'id'
p10
S'ac7c0755-43ab-40b8-bc39-5315236a3639'
p11
sS'expires'
p12
NsS'utc'
p13
I00
s.

------

(dp1
S'retries'
p2
I0
sS'task'
p3
S'provisioningserver.tasks.report_boot_images'
p4
sS'kwargs'
p5
(dp6
sS'eta'
p7
NsS'args'
p8
(lp9
sS'id'
p10
S'd5447c64-8317-4d13-a5ce-d1c818c2efc0'
p11
sS'expires'
p12
NsS'utc'
p13
I00
s.

Changed in maas (Ubuntu):
importance: Undecided → Critical
Changed in maas:
importance: Undecided → High
Changed in maas (Ubuntu):
importance: Critical → High
Julian Edwards (julian-edwards) wrote :

I suspect this is caused by task messages getting generated before a celeryd listener is connected/registered. I saw it once myself (but only once).

Changed in maas:
status: New → Triaged
importance: High → Critical
Raphaël Badin (rvb) wrote :

Adding a TTL to the celery queues will avoid the out of memory error. But if we want to fix the core problem we need to understand why the tasks are pilling up without being processed by the workers.

Raphaël Badin (rvb) wrote :

If someone can recreate this problem, it would be good to see in the celery logs (/var/log/maas/*celery*.log) if there isn't an indication of what's wrong with the workers.

Raphaël Badin (rvb) wrote :

> I suspect this is caused by task messages getting generated before a celeryd listener is connected/registered. I saw it once myself (but only once).

Did the tasks start piling up right from the start or did the system work for a while, then stopped working with the tasks piling up?

Raphaël Badin (rvb) wrote :

This is pretty difficult to diagnose without access to a system that shows the problem… If you're seeing this bug, could you please run a couple of commands and paste the results:

sudo rabbitmqctl list_queues -p /maas_workers name messages_unacknowledged messages memory
service maas-cluster-celery status
service maas-region-celery status
# pastebin the content of /var/log/maas/celery.log
# pastebin the content of /var/log/maas/celery-region.log
# pastebin the content of /var/log/maas/maas.log

Mark Stillwell (marklee) wrote :

Unfortunately, due in part to this bug, we aren't currently using MAAS.

Erick Turnquist (jhujhiti) wrote :

I've just hit this again overnight. Logs are attached, command output is below.

$ sudo rabbitmqctl list_queues -p /maas_workers name messages_unacknowledged
Listing queues ...
c1d352af-8060-4bf4-b820-f468b22a97fb 4
celery 0
maas-juju.celeryd.pidbox 0
master 0
...done.
$ sudo service maas-cluster-celery status
maas-cluster-celery start/running, process 6499
$ sudo service maas-region-celery status
maas-region-celery start/running, process 6555

Gavin Panella (allenap) wrote :

That maas.log from jhujhiti is packed with OAuthUnauthorized exceptions.

Right at the end of celery.log there's:

[2013-05-29 10:33:56,514: INFO/MainProcess] Got task from broker: provisioningserver.tasks.refresh_secrets[842cc6d7-7e55-4098-a31e-221b87b0b545]

and a little time later:

[2013-05-29 10:41:28,385: INFO/PoolWorker-3] Not sending DHCP leases to server: not all required knowledge received from server yet. Missing: api_credentials, nodegroup_uuid

which suggests that refresh_secrets has not fully worked.

Raphaël Badin (rvb) wrote :

jhujhiti told me that the celery worker is eating up all the cpu and most of the memory.

In celery.log, I see: [2013-05-29 10:41:28,386: INFO/MainProcess] Task provisioningserver.tasks.upload_dhcp_leases[11a33416-5a33-4530-8fdf-e0825b62d616] succeeded in *451.886018991s*: None

I thought that it was a long time to upload the dhcp lease file but then jhujhiti told me that the lease file is 26M.

Erick Turnquist (jhujhiti) wrote :

Leases file is attached. I am working on figuring out why there are so many abandoned leases, but hopefully MAAS can be made to not choke on such a large file.

Have any you seen this again lately?

I haven't used MAAS for about 6 months precisely because of this
issue. It may have been fixed in the interim.

On Tue, Oct 15, 2013 at 6:43 AM, Julian Edwards
<email address hidden> wrote:
> Have any you seen this again lately?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1153077
>
> Title:
> rabbitmq queue fills up and celery stops executing tasks when
> upload_dhcp_leases is done every minute (by default)
>
> Status in MAAS:
> Triaged
> Status in “maas” package in Ubuntu:
> Confirmed
>
> Bug description:
> With the default MAAS install I was having a problem where celery
> would become unresponsive and stop running new tasks (including
> powering compute nodes on/off) after a while, and eventually the
> system would have out of memory errors. The problem seems to have been
> that the queue was filling up with dhcp_upload_leases tasks (I checked
> on this by getting tasks out with the amqp tools...). This problem
> goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py
> is changed from 1 minute to 5 minutes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1153077/+subscriptions

--
Mark Lee Stillwell
<email address hidden>

On Tuesday 15 Oct 2013 08:48:08 you wrote:
> I haven't used MAAS for about 6 months precisely because of this
> issue. It may have been fixed in the interim.

Are you able to try it again? I have not seen the problem in a long time,
although it may have been a RabbitMQ bug in precise.

Mark Stillwell (marklee) wrote :

For the moment at least the system I had tried to install MAAS on is
in use. I may look at it again during the winter break.

On Tue, Oct 15, 2013 at 10:01 AM, Julian Edwards
<email address hidden> wrote:
> On Tuesday 15 Oct 2013 08:48:08 you wrote:
>> I haven't used MAAS for about 6 months precisely because of this
>> issue. It may have been fixed in the interim.
>
> Are you able to try it again? I have not seen the problem in a long time,
> although it may have been a RabbitMQ bug in precise.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1153077
>
> Title:
> rabbitmq queue fills up and celery stops executing tasks when
> upload_dhcp_leases is done every minute (by default)
>
> Status in MAAS:
> Triaged
> Status in “maas” package in Ubuntu:
> Confirmed
>
> Bug description:
> With the default MAAS install I was having a problem where celery
> would become unresponsive and stop running new tasks (including
> powering compute nodes on/off) after a while, and eventually the
> system would have out of memory errors. The problem seems to have been
> that the queue was filling up with dhcp_upload_leases tasks (I checked
> on this by getting tasks out with the amqp tools...). This problem
> goes away if the timedelta in /usr/share/maas/celeryconfig_cluster.py
> is changed from 1 minute to 5 minutes.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1153077/+subscriptions

--
Mark Lee Stillwell
<email address hidden>

Gavin Panella (allenap) wrote :

Erick, thanks for uploading your leases file. I've improved the code so that it parses it in less than a couple of seconds, and there's plenty of scope for improving that. I'll try to polish it and land it later this week.

Downgrading - nobody I know has experienced this lately.

Changed in maas:
importance: Critical → High

I'm going to close this bug unless anyone can confirm it's still a problem? Nobody I know has seen this happen for at least the whole of the last development cycle.

Changed in maas:
status: Triaged → Incomplete
Changed in maas (Ubuntu):
status: Confirmed → Incomplete
Gavin Panella (allenap) wrote :

I've recently seen a somewhat similar situation:

- Celery went down for a while.

- Celerybeat jobs accumulated in RabbitMQ.

- When Celery came back up, it spun manically, going through hundreds of old jobs.

However, neither Celery nor RabbitMQ showed signs of falling over. They were just busy.

It's a separate bug, but we should ensure that Celerybeat jobs have a very short lifetime. Maybe RabbitMQ or Celery have a mechanism for ignoring old jobs. If not, we can add a timestamp check to the job implementations themselves.

Raphaël Badin (rvb) wrote :

> Maybe RabbitMQ or Celery have a mechanism for ignoring old jobs. If not, we can add a timestamp check to the job
> implementations themselves.

That's a good point, there is support in Celery for this:
http://celery.github.io/celery/userguide/executing.html#expiration

On Tuesday 25 Feb 2014 10:17:25 you wrote:
> - Celery went down for a while.
>
> - Celerybeat jobs accumulated in RabbitMQ.

That's weird, the beat stuff is integrated into the celeryd isn't it?

> It's a separate bug, but we should ensure that Celerybeat jobs have a
> very short lifetime. Maybe RabbitMQ or Celery have a mechanism for
> ignoring old jobs. If not, we can add a timestamp check to the job
> implementations themselves.

Yep, we need some protection against this.

Launchpad Janitor (janitor) wrote :

[Expired for maas (Ubuntu) because there has been no activity for 60 days.]

Changed in maas (Ubuntu):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers