Here are my testing notes. I ran eight different scenarios. In summary, increasing `concurrent_tasks` to 2* the number of worker processes and setting CELERY_ACKS_LATE = True, CELERYD_PREFETCH_MULTIPLIER = 1 in celeryconfig.py seems to solve the problem.
All tests use the same job configuration.
Tasks = 544
----------
Test 1
----------
Test:
All machines (272 cores).
`concurrent_tasks` = 320.
Result:
Result showed that bs04, gm01, and gm02
we under-utilized from the start of the calculation.
----------
Test 2
----------
Test:
Test run with only bs04 and gm0{1,2}.
Result:
Test shows full core utilization from the start (48, 48, and 48).
----------
Test 3
----------
Test:
All machines again, this time with `CELERYD_PREFETCH_MULTIPLIER = 1`.
Result:
Result was the same as Test 1; the 48-core machines are under-utilized.
----------
Test 4
----------
Test:
`CELERYD_PREFETCH_MULTIPLIER = 1` and also set the `concurrent_tasks` parameter in
openquake.cfg to 272 (down from 320) to match the number of workers.
Result:
This gave similar results to Tests 1 and 3, except the initial utilization of bs04
and gm0{1,2} was even worse: only 39 cores were used.
----------
Test 5
----------
Test:
`CELERYD_PREFETCH_MULTIPLIER = 1`, `concurrent_tasks` set to double the amount of cores
(272 * 2 = 544).
Result:
This gave full utilization from the start (48, 48, 48, 32, 32, 32, 32). Distribution of
work was pretty even throughout the entire calculation.
----------
Test 6
----------
Test:
Remove CELERYD_PREFETCH_MULTIPLIER, reset to default.
`concurrent_tasks = 544` (same as Test 5).
Result:
The result was about the same as Test 5. It seems that changing the
CELERYD_PREFETCH_MULTIPLIER doesn't make a different (at least with
the values used thus far).
The result was basically the same as Tests 5 and 6. I note that
the larger machines (bs04, gms) finished tasks quicker
and become idle still sooner than the gs machines. Probably we will
benefit from reducing the CELERYD_PREFETCH_MULTIPLIER to 1.
----------
Test 8
----------
Test:
Same as Test 1, but start workers in a different order (first the gs machines, then the
other 3).
Here are my testing notes. I ran eight different scenarios. In summary, increasing `concurrent_tasks` to 2* the number of worker processes and setting CELERY_ACKS_LATE = True, CELERYD_ PREFETCH_ MULTIPLIER = 1 in celeryconfig.py seems to solve the problem.
All tests use the same job configuration.
Tasks = 544
----------
Test 1
----------
Test:
All machines (272 cores).
`concurrent_tasks` = 320.
Result:
Result showed that bs04, gm01, and gm02
we under-utilized from the start of the calculation.
----------
Test 2
----------
Test:
Test run with only bs04 and gm0{1,2}.
Result:
Test shows full core utilization from the start (48, 48, and 48).
----------
Test 3
----------
Test:
All machines again, this time with `CELERYD_ PREFETCH_ MULTIPLIER = 1`.
Result:
Result was the same as Test 1; the 48-core machines are under-utilized.
----------
Test 4
----------
Test:
`CELERYD_ PREFETCH_ MULTIPLIER = 1` and also set the `concurrent_tasks` parameter in
openquake.cfg to 272 (down from 320) to match the number of workers.
Result:
This gave similar results to Tests 1 and 3, except the initial utilization of bs04
and gm0{1,2} was even worse: only 39 cores were used.
----------
Test 5
----------
Test:
`CELERYD_ PREFETCH_ MULTIPLIER = 1`, `concurrent_tasks` set to double the amount of cores
(272 * 2 = 544).
Result:
This gave full utilization from the start (48, 48, 48, 32, 32, 32, 32). Distribution of
work was pretty even throughout the entire calculation.
----------
Test 6
----------
Test:
Remove CELERYD_ PREFETCH_ MULTIPLIER, reset to default.
`concurrent_tasks = 544` (same as Test 5).
Result:
The result was about the same as Test 5. It seems that changing the PREFETCH_ MULTIPLIER doesn't make a different (at least with
CELERYD_
the values used thus far).
----------
Test 7
----------
Test:
Increase task count to 3 * 272 = 816.
`concurrent_tasks = 544` (2 * 272)
Result:
The result was basically the same as Tests 5 and 6. I note that PREFETCH_ MULTIPLIER to 1.
the larger machines (bs04, gms) finished tasks quicker
and become idle still sooner than the gs machines. Probably we will
benefit from reducing the CELERYD_
----------
Test 8
----------
Test:
Same as Test 1, but start workers in a different order (first the gs machines, then the
other 3).
Result:
No significant differences from Test 1.