Bug #881894 “Celery comes to a halt after approx. 8500 tasks” : Bugs : OpenQuake (deprecated)

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-26:

#1

This is probably relevant: http://groups.google.com/group/celery-users/browse_thread/thread/5eca0ab5a2047dec

Changed in openquake:
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Muharem Hrnjadovic (al-maisan)
milestone:	none → 0.4.5

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-26:

#2

Please see also http://groups.google.com/group/celery-users/browse_thread/thread/2f5c2eb0ae761ab5

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-26:

#3

Are we hitting this problem?
http://ask.github.com/celery/faq.html#why-aren-t-my-tasks-processed

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-26:

#4

we might be hitting the RabbitMQ high memory watermark:
http://old.nabble.com/Producers-hanging-when-reaching-high-memory-watermark-on-1.8.1-td29431046.html

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-29:

#5

The input model used for all tests Edit (705.1 KiB, application/x-tar)

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-10-29:

#6

Download full text (3.7 KiB)

= Introducton =

I have been running OpenQuake jobs using the input model in [6] and the
packages from [1] as follows

    - gemsun02 in control node role and running the postgresql, rabbitmq
      and redis server daemons (configuration: [2])
    - gemsun01/gemsun03/gemsun04 in worker node role only running
      celeryd (configuration: [3])
    - on gemsun02 the patch shown in [4] has been applied against
      /usr/share/pyshared/openquake (in essence the same as [5])

This job calculates 148011 sites, we pack 1 site in a single celery task
since higher sites-per-celery-task ratios result in *long* (hazard
curve) calculation times on the workers (e.g. 3.6 hours for 72 sites).

When packing one site per celery request I am observing computation
times around 25 seconds (i.e. 72 sites take approx. 30 minutes). This
issue should be analysed separately and bug [7] was filed to track that.

= Observations =

celeryd is started on the worker machines as follows:

cd /usr/openquake && nohup celeryd --time-limit 300 --purge -l DEBUG -B -c 14 > /tmp/celeryd.log 2>&1 3>&1 &

At some point we hit the wall i.e.
    - all worker machines finish calculating their current tasks and
      become idle (the load is zero)
    - the control node stops sending new tasks (creating per-task
      queues)

Initially we hit the wall after approx. 8500 tasks, now the upper limit
is approx. 25000 tasks. It is not clear why that limit increased.

    - when we do hit the wall, RabbitMQ appears blocked i.e. no new
      messages can be enqueued and message producers are blocked.
      This makes it impossible for celeryd workers to report results
      back to to the control node. The latter hence refrains from
      pushing more tasks into the compute network.

- I am running

while ((1)); do date; celeryctl inspect active; sleep 60; done

      on the side and it works perfectly until we hit the wall, after
      that point the tool is blocked. Please note that it works by
      sending and receiving messages. This is another indication that
      RabbitMQ is hosed.

    - The patch that pushed us beyond 8500 tasks on Thursday morning
      (27-Oct-2011) (see [8]) probably only did so because it reduced
      the overall message volume.
      However, when the patch [8] is applied the calculations are
      performed 11.45 times slower. This comparison is based on the time
      it took to calculate the first 18000 tasks with the patch applied
      and reverted respectively.
      It is unclear how and why that performance degradation comes
      about.

= Guesses =

Given what I have seen so far I am guessing that RabbitMQ is the
culprit.
We should find a way to assess whether it is alive and healthy when
we hit the wall. In case RabbitMQ *is* hosed we need to find out
    - why
    - whether and how it be can (re-)configured so it performs
      on the level needed (disabling the high memory watermark made
      no difference)

= Other ideas =

    - upgrade to the most recent RabbitMQ release and see whether the
      problem persists (Ask Solem, the author of python-celery,
      considered RabbitMQ rev. 2.3.1 (what we use) already a bit a...

= Introducton =

I have been running OpenQuake jobs using the input model in [6] and the
packages from [1] as follows

- gemsun02 in control node role and running the postgresql, rabbitmq
      and redis server daemons (configuration: [2])
    - gemsun01/gemsun03/gemsun04 in worker node role only running
      celeryd (configuration: [3])
    - on gemsun02 the patch shown in [4] has been applied against
      /usr/share/pyshared/openquake (in essence the same as [5])

This job calculates 148011 sites, we pack 1 site in a single celery task
since higher sites-per-celery-task ratios result in *long* (hazard
curve) calculation times on the workers (e.g. 3.6 hours for 72 sites).

When packing one site per celery request I am observing computation
times around 25 seconds (i.e. 72 sites take approx. 30 minutes). This
issue should be analysed separately and bug [7] was filed to track that.

= Observations =

celeryd is started on the worker machines as follows:

cd /usr/openquake && nohup celeryd --time-limit 300 --purge -l DEBUG -B -c 14 > /tmp/celeryd.log 2>&1 3>&1 &

At some point we hit the wall i.e.
    - all worker machines finish calculating their current tasks and
      become idle (the load is zero)
    - the control node stops sending new tasks (creating per-task
      queues)

Initially we hit the wall after approx. 8500 tasks, now the upper limit
is approx. 25000 tasks. It is not clear why that limit increased.

- when we do hit the wall, RabbitMQ appears blocked i.e. no new
      messages can be enqueued and message producers are blocked.
      This makes it impossible for celeryd workers to report results
      back to to the control node. The latter hence refrains from
      pushing more tasks into the compute network.

- I am running

while ((1)); do date; celeryctl inspect active; sleep 60; done

on the side and it works perfectly until we hit the wall, after
      that point the tool is blocked. Please note that it works by
      sending and receiving messages. This is another indication that
      RabbitMQ is hosed.

- The patch that pushed us beyond 8500 tasks on Thursday morning
      (27-Oct-2011) (see [8]) probably only did so because it reduced
      the overall message volume.
      However, when the patch [8] is applied the calculations are
      performed 11.45 times slower. This comparison is based on the time
      it took to calculate the first 18000 tasks with the patch applied
      and reverted respectively.
      It is unclear how and why that performance degradation comes
      about.
      
= Guesses =

Given what I have seen so far I am guessing that RabbitMQ is the
culprit.
We should find a way to assess whether it is alive and healthy when
we hit the wall. In case RabbitMQ *is* hosed we need to find out
    - why
    - whether and how it be can (re-)configured so it performs
      on the level needed (disabling the high memory watermark made
      no difference)

= Other ideas =

- upgrade to the most recent RabbitMQ release and see whether the
      problem persists (Ask Solem, the author of python-celery,
      considered RabbitMQ rev. 2.3.1 (what we use) already a bit aged.
    - upgrade to Ubuntu server 11.10: that would give us a fairly new
      RabbitMQ (rev. 2.5.0) as well as postgres rev. 9.1

= References =

[1] https://launchpad.net/~openquake/+archive/testing/+packages?field.name_filter=&field.status_filter=published&field.series_filter=natty
[2] http://paste.ubuntu.com/722151/
[3] http://paste.ubuntu.com/722148/
[4] http://paste.ubuntu.com/722155/
[5] https://github.com/gem/openquake/pull/556
[6] https://bugs.launchpad.net/openquake/+bug/881894/+attachment/2577525/+files/8500-model.tgz
[7] https://bugs.launchpad.net/openquake/+bug/880854
[8] http://paste.ubuntu.com/722449/

John Tarter (toh2) on 2011-11-01

Changed in openquake:
milestone:	0.4.5 → 0.4.6

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2011-11-14:

#7

RabbitMQ *was* the culprit, it would get stuck after a while due to a low maximum number of open files (1024 by default).

The following fixes the issue *altogether*:

root@gemsun02:~# ulimit -n 32768
root@gemsun02:~# /etc/init.d/rabbitmq-server restart

Muharem Hrnjadovic (al-maisan) on 2011-11-14

tags:

added: celery devop rabbitmq

Muharem Hrnjadovic (al-maisan) on 2011-11-15

Changed in openquake:
status:	In Progress → Fix Committed

Lars Butler (lars-butler) on 2013-04-05

Changed in openquake:
status:	Fix Committed → Fix Released

OpenQuake (deprecated)

Celery comes to a halt after approx. 8500 tasks

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches