Investigate why worker usage seems to decline over time with large calculations
Bug #1092050 reported by
Lars Butler
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenQuake (deprecated) |
Fix Released
|
Critical
|
Lars Butler |
Bug Description
With large calculations, spanning to whole of Europe for example, we've observed that over time many workers are sitting idle. There is plenty of work left to be done, but there are simply no tasks being assigned to idle workers.
I'm betting there is a subtle bug with the task distribution code. What is supposed to happen is the following:
- fill up the queue with tasks
- wait for tasks to complete
- when we receive a task completion notice, queue one additional task
My hunch is that this is somehow breaking in some cases and we are only queuing 1 task when > 1 tasks are completing. Over time, this results in many worker processes sitting idle.
Changed in openquake: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Here are my testing notes. I ran eight different scenarios. In summary, increasing `concurrent_tasks` to 2* the number of worker processes and setting CELERY_ACKS_LATE = True, CELERYD_ PREFETCH_ MULTIPLIER = 1 in celeryconfig.py seems to solve the problem.
All tests use the same job configuration.
Tasks = 544
----------
Test 1
----------
Test:
All machines (272 cores).
`concurrent_tasks` = 320.
Result:
Result showed that bs04, gm01, and gm02
we under-utilized from the start of the calculation.
----------
Test 2
----------
Test:
Test run with only bs04 and gm0{1,2}.
Result:
Test shows full core utilization from the start (48, 48, and 48).
----------
Test 3
----------
Test:
All machines again, this time with `CELERYD_ PREFETCH_ MULTIPLIER = 1`.
Result:
Result was the same as Test 1; the 48-core machines are under-utilized.
----------
Test 4
----------
Test:
`CELERYD_ PREFETCH_ MULTIPLIER = 1` and also set the `concurrent_tasks` parameter in
openquake.cfg to 272 (down from 320) to match the number of workers.
Result:
This gave similar results to Tests 1 and 3, except the initial utilization of bs04
and gm0{1,2} was even worse: only 39 cores were used.
----------
Test 5
----------
Test:
`CELERYD_ PREFETCH_ MULTIPLIER = 1`, `concurrent_tasks` set to double the amount of cores
(272 * 2 = 544).
Result:
This gave full utilization from the start (48, 48, 48, 32, 32, 32, 32). Distribution of
work was pretty even throughout the entire calculation.
----------
Test 6
----------
Test:
Remove CELERYD_ PREFETCH_ MULTIPLIER, reset to default.
`concurrent_tasks = 544` (same as Test 5).
Result:
The result was about the same as Test 5. It seems that changing the PREFETCH_ MULTIPLIER doesn't make a different (at least with
CELERYD_
the values used thus far).
----------
Test 7
----------
Test:
Increase task count to 3 * 272 = 816.
`concurrent_tasks = 544` (2 * 272)
Result:
The result was basically the same as Tests 5 and 6. I note that PREFETCH_ MULTIPLIER to 1.
the larger machines (bs04, gms) finished tasks quicker
and become idle still sooner than the gs machines. Probably we will
benefit from reducing the CELERYD_
----------
Test 8
----------
Test:
Same as Test 1, but start workers in a different order (first the gs machines, then the
other 3).
Result:
No significant differences from Test 1.