Comment 17 for bug 1341420

Revision history for this message
Robert Collins (lifeless) wrote :

@PaulMurray, @Sylvain

This particular bug occurs with *one* scheduler instance, and 2 resources under subscription, and its a clear example of bad data hygiene - eventual consistency doesn't imply the poor behaviour we use.

Trivially, caching the schedule operations we've made locally and folding those in until the hypervisor would fix this bug with no additional synchronisation overhead.

I appreciate that everyone is concerned about the scheduler, but before saying we can't use a technique, we need to be clear about what our success criteria are. One of the big scheduling problems we have is that we have no specific success criteria, and its broken with bugs like this, so we get design pushback not on the basis of actually success or failure, but fear that it will be worse.

For instance, if we define success as 'be able to place up to 10000 uses/second on up to 1M resources and no more than 10 seconds downtime in the event of scheduler failure', then we have a metric we can examine to assess scheduling implementations. E.g. A single fast scheduler with HA via fail-over on a 5s heartbeat with 15s warmup time could meet this. As could a distributed scheduler with sharding, or possibly a consensus scheduler with sync-on-over-subscribe [e.g. only one retry ever triggered].