Bug #1488986 “nova scheduler for race condition” : Bugs : OpenStack Compute (nova)

Revision history for this message

hougangliu (liuhoug) wrote on 2015-08-27:

#1

even ignore periodic task update_available_resource, I think there is still race condition for scheduler.
if first boot request uses ---max-count option, saying 10, and the ten instances are scheduled to the same compute node. And the corresponding compute node service will try to spawn the instances one by one. when handling the 3rd one, the compute info in DB has only accounted 3 instances' consumption. At this moment, the second boot request is being handled by scheduler and the HostState for this boot is updated with info from DB which only accounts 3 instances' consumption. So second boot can be scheduled wrongly.

Revision history for this message

jichenjc (jichenjc) wrote on 2015-08-27:

#2

This is a known issue , I can't find the exact bp but a lot of work are ongoing :)
in case some one can find and share the links here... FYI

Markus Zoeller (markus_z) (mzoeller) on 2015-08-28

tags:

added: scheduler

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2015-09-23:

#3

So, it's pretty hard to explain in one small comment how the model is behaving, but please consider that we have 'sort of' two-phase commits when booting an instance.

When a request comes in, you're right, the instances are elected iteratively by decrementing the resource usage of the elected node in HostState.consume_from_instance(). That means that when you're asking for 10 instances of the same type, the corresponding HostState(s) will be decremented before the next filters call which should provide a good way for ensuring consistency. That's only when the 10 instances are elected that the scheduler gives the answer back to the *conductor* which calls the respective compute managers (ie. your step #3 is incorrect since Juno).

Now, that HostState model is something kept in-memory and only refreshed when a new request comes in. That means that if you have two schedulers running separately (or when you have 2 concurrent requests coming in), then yes you could have race conditions.

That's not really a problem in general, because if your could is enough sized, it will go to the compute manager which will use a context manager called "instance_claim()" for ensuring that its *OWN* internal representation is correct (and that method is thread-safe in the context). If the scheduler decision was incorrect, then it raises an error which is catched by the compute manager which calls again the conductor to ask for a reschedule (by excluding the wrong host).

So, see, when we have races, we have retries (that's the 2PC I mentioned). That's not perfect, in particular when the cloud is a bit full, and that's why we're working towards resolving that thru multiple possibilities :

https://review.openstack.org/#/c/192260/7/doc/source/scheduler_evolution.rst,cm

To be honest, I don't see clear actionable items in your bug request. I'd rather propose you to join the scheduler meetings happening every Mondays at 1400 UTC if you wish to help us and contribute.

So, it's pretty hard to explain in one small comment how the model is behaving, but please consider that we have 'sort of' two-phase commits when booting an instance.

When a request comes in, you're right, the instances are elected iteratively by decrementing the resource usage of the elected node in HostState.consume_from_instance(). That means that when you're asking for 10 instances of the same type, the corresponding HostState(s) will be decremented before the next filters call which should provide a good way for ensuring consistency. That's only when the 10 instances are elected that the scheduler gives the answer back to the *conductor* which calls the respective compute managers (ie. your step #3 is incorrect since Juno).

Now, that HostState model is something kept in-memory and only refreshed when a new request comes in. That means that if you have two schedulers running separately (or when you have 2 concurrent requests coming in), then yes you could have race conditions.

That's not really a problem in general, because if your could is enough sized, it will go to the compute manager which will use a context manager called "instance_claim()" for ensuring that its *OWN* internal representation is correct (and that method is thread-safe in the context). If the scheduler decision was incorrect, then it raises an error which is catched by the compute manager which calls again the conductor to ask for a reschedule (by excluding the wrong host).

So, see, when we have races, we have retries (that's the 2PC I mentioned). That's not perfect, in particular when the cloud is a bit full, and that's why we're working towards resolving that thru multiple possibilities :

https://review.openstack.org/#/c/192260/7/doc/source/scheduler_evolution.rst,cm

To be honest, I don't see clear actionable items in your bug request. I'd rather propose you to join the scheduler meetings happening every Mondays at 1400 UTC if you wish to help us and contribute.

Changed in nova:
status:	New → Opinion
importance:	Undecided → Wishlist

OpenStack Compute (nova)

nova scheduler for race condition

Bug Description

Other bug subscribers

Remote bug watches