Comment 14 for bug 1341420

Revision history for this message
Paul Murray (pmurray) wrote :

@lifeless - Do any of those consensus answers scale well and handle failure gracefully? ;)

I think the issue here is that there are two sets of problems when you reach opposite ends of scale in both number of machines and what can fit on a single machine. At the "we have so many resources you can consider them infinite" end of the world an approximate view is the optimal one because is scales and there is a low likelyhood of being wrong. In the case presented in this bug report, where the problem is the system is approaching resource starvation, the optimal solution would be to have a globally consistent view of resources.

The current design is aimed at the first of these two.

Improving the retry mechanism does make sense, but I would make sure that we do not sacrafice scalability and graceful failure handling in the process.

So anything that can shorten the delay in getting up to date information to the scheduler is good. Anything that introduces any kind of synchronisation is bad - including additional lookups in the database.