Comment 21 for bug 1818239

Revision history for this message
Dan Smith (danms) wrote :

With the weigher, you shouldn't be able to "take down" anything. You may stack a lot more instances on the non-error-reporting hosts, but once those are full, the scheduler will try one fo the hosts reporting errors, and as soon as one succeeds there, the score resets to zero. So can you clarify "took down" in this context?

Also, the weight given to this weigher, like all others, is configurable. If you have no desire to deprioritize failing hosts, you can set it to zero, and if you want this to have a smaller impact then you can change the weight to something smaller. The default weight was carefully chosen to cause a failing host to have a lower weight than others, all things equivalent. Since the disk weigher scales by free bytes (or whatever), if you're a new compute node that has no instances (and thus a lot of free space) and a bad config that will cause you to fail every boot, the fail weigher has to have an impactful score, else it really will have no effect.

I've nearly lost the will to even argue about this issue, so I'm not sure what my opinion is on setting the default to zero, other than to say that the converse argument is also true... If you have one compute node with a broken config (or even just something preventing it from talking to neutron), it will attract all builds in the scheduler, fail them, and the cloud is effectively down until a human is paged to remedy the situation. That was the case this was originally trying to mitigate in its original and evolved forms.