Celery comes to a halt after approx. 8500 tasks

Bug #881894 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Fix Released
High
Muharem Hrnjadovic

Bug Description

Restarting one of the worker processes on gemsun01/3/4 helps sometimes but not reliably

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Changed in openquake:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Muharem Hrnjadovic (al-maisan)
milestone: none → 0.4.5
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Download full text (3.7 KiB)

= Introducton =

I have been running OpenQuake jobs using the input model in [6] and the
packages from [1] as follows

    - gemsun02 in control node role and running the postgresql, rabbitmq
      and redis server daemons (configuration: [2])
    - gemsun01/gemsun03/gemsun04 in worker node role only running
      celeryd (configuration: [3])
    - on gemsun02 the patch shown in [4] has been applied against
      /usr/share/pyshared/openquake (in essence the same as [5])

This job calculates 148011 sites, we pack 1 site in a single celery task
since higher sites-per-celery-task ratios result in *long* (hazard
curve) calculation times on the workers (e.g. 3.6 hours for 72 sites).

When packing one site per celery request I am observing computation
times around 25 seconds (i.e. 72 sites take approx. 30 minutes). This
issue should be analysed separately and bug [7] was filed to track that.

= Observations =

celeryd is started on the worker machines as follows:

    cd /usr/openquake && nohup celeryd --time-limit 300 --purge -l DEBUG -B -c 14 > /tmp/celeryd.log 2>&1 3>&1 &

At some point we hit the wall i.e.
    - all worker machines finish calculating their current tasks and
      become idle (the load is zero)
    - the control node stops sending new tasks (creating per-task
      queues)

Initially we hit the wall after approx. 8500 tasks, now the upper limit
is approx. 25000 tasks. It is not clear why that limit increased.

    - when we do hit the wall, RabbitMQ appears blocked i.e. no new
      messages can be enqueued and message producers are blocked.
      This makes it impossible for celeryd workers to report results
      back to to the control node. The latter hence refrains from
      pushing more tasks into the compute network.

    - I am running

        while ((1)); do date; celeryctl inspect active; sleep 60; done

      on the side and it works perfectly until we hit the wall, after
      that point the tool is blocked. Please note that it works by
      sending and receiving messages. This is another indication that
      RabbitMQ is hosed.

    - The patch that pushed us beyond 8500 tasks on Thursday morning
      (27-Oct-2011) (see [8]) probably only did so because it reduced
      the overall message volume.
      However, when the patch [8] is applied the calculations are
      performed 11.45 times slower. This comparison is based on the time
      it took to calculate the first 18000 tasks with the patch applied
      and reverted respectively.
      It is unclear how and why that performance degradation comes
      about.

= Guesses =

Given what I have seen so far I am guessing that RabbitMQ is the
culprit.
We should find a way to assess whether it is alive and healthy when
we hit the wall. In case RabbitMQ *is* hosed we need to find out
    - why
    - whether and how it be can (re-)configured so it performs
      on the level needed (disabling the high memory watermark made
      no difference)

= Other ideas =

    - upgrade to the most recent RabbitMQ release and see whether the
      problem persists (Ask Solem, the author of python-celery,
      considered RabbitMQ rev. 2.3.1 (what we use) already a bit a...

Read more...

John Tarter (toh2)
Changed in openquake:
milestone: 0.4.5 → 0.4.6
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

RabbitMQ *was* the culprit, it would get stuck after a while due to a low maximum number of open files (1024 by default).

The following fixes the issue *altogether*:

    root@gemsun02:~# ulimit -n 32768
    root@gemsun02:~# /etc/init.d/rabbitmq-server restart

tags: added: celery devop rabbitmq
Changed in openquake:
status: In Progress → Fix Committed
Changed in openquake:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.