Investigate why worker usage seems to decline over time with large calculations

Bug #1092050 reported by Lars Butler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Fix Released
Critical
Lars Butler

Bug Description

With large calculations, spanning to whole of Europe for example, we've observed that over time many workers are sitting idle. There is plenty of work left to be done, but there are simply no tasks being assigned to idle workers.

I'm betting there is a subtle bug with the task distribution code. What is supposed to happen is the following:

- fill up the queue with tasks
- wait for tasks to complete
- when we receive a task completion notice, queue one additional task

My hunch is that this is somehow breaking in some cases and we are only queuing 1 task when > 1 tasks are completing. Over time, this results in many worker processes sitting idle.

Revision history for this message
Lars Butler (lars-butler) wrote :

Here are my testing notes. I ran eight different scenarios. In summary, increasing `concurrent_tasks` to 2* the number of worker processes and setting CELERY_ACKS_LATE = True, CELERYD_PREFETCH_MULTIPLIER = 1 in celeryconfig.py seems to solve the problem.

All tests use the same job configuration.
Tasks = 544

----------
Test 1
----------

Test:

All machines (272 cores).
`concurrent_tasks` = 320.

Result:

Result showed that bs04, gm01, and gm02
we under-utilized from the start of the calculation.

----------
Test 2
----------

Test:

Test run with only bs04 and gm0{1,2}.

Result:

Test shows full core utilization from the start (48, 48, and 48).

----------
Test 3
----------

Test:

All machines again, this time with `CELERYD_PREFETCH_MULTIPLIER = 1`.

Result:

Result was the same as Test 1; the 48-core machines are under-utilized.

----------
Test 4
----------

Test:

`CELERYD_PREFETCH_MULTIPLIER = 1` and also set the `concurrent_tasks` parameter in
openquake.cfg to 272 (down from 320) to match the number of workers.

Result:

This gave similar results to Tests 1 and 3, except the initial utilization of bs04
and gm0{1,2} was even worse: only 39 cores were used.

----------
Test 5
----------

Test:

`CELERYD_PREFETCH_MULTIPLIER = 1`, `concurrent_tasks` set to double the amount of cores
(272 * 2 = 544).

Result:

This gave full utilization from the start (48, 48, 48, 32, 32, 32, 32). Distribution of
work was pretty even throughout the entire calculation.

----------
Test 6
----------

Test:

Remove CELERYD_PREFETCH_MULTIPLIER, reset to default.
`concurrent_tasks = 544` (same as Test 5).

Result:

The result was about the same as Test 5. It seems that changing the
CELERYD_PREFETCH_MULTIPLIER doesn't make a different (at least with
the values used thus far).

----------
Test 7
----------

Test:

Increase task count to 3 * 272 = 816.
`concurrent_tasks = 544` (2 * 272)

Result:

The result was basically the same as Tests 5 and 6. I note that
the larger machines (bs04, gms) finished tasks quicker
and become idle still sooner than the gs machines. Probably we will
benefit from reducing the CELERYD_PREFETCH_MULTIPLIER to 1.

----------
Test 8
----------

Test:

Same as Test 1, but start workers in a different order (first the gs machines, then the
other 3).

Result:

No significant differences from Test 1.

Changed in openquake:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Lars Butler (lars-butler)
milestone: none → 0.9.0
Revision history for this message
Lars Butler (lars-butler) wrote :

The conclusion here is that in order to maximize utilization of a compute cluster, the following steps should be taken:

- In celeryconfig.py, set CELERY_ACKS_LATE=True, CELERYD_PREFETCH_MULTIPLIER=1. This ensures that each worker will only take one task at a time and will greedily prefetch. This is the recommend configuration for long-running tasks, as advised by asksol, the Celery author. See also: http://docs.celeryproject.org/en/master/userguide/optimizing.html#prefetch-limits
- In the openquake.cfg, the recommended number of `concurrent_tasks` is 2*the number of worker processes.

These changes ensure optimal task distribution, which most important for large calculations.

Related code changes: https://github.com/gem/oq-engine/pull/984

Changed in openquake:
status: In Progress → Fix Committed
Changed in openquake:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.