Comment 12 for bug 1372049

Revision history for this message
Oleg Bondarev (obondarev) wrote :

So the result of my analysis is following: eventually the root cause is in: https://github.com/openstack/oslo.messaging/blob/master/oslo/messaging/_executors/impl_eventlet.py
In particular rpc_thread_pool_size which is 64 by default.
When spawning >= 64 instances at the same time in n-cpu we have 64 blocked threads waiting for network-vif-plugged events.
Then when network-vif-plugged events come to n-cpu from n-api by rpc (neutron -> n-api -> n-cpu) - there is no available threads in thread pool to handle them.
After instances start to fall with timeouts - available threads appear and start to handle network-vif-plugged events so that the rest of the instances become active (right before timeout for them occures).

So we have 1:1 relationship between rpc_thread_pool_size and the number of instances that can be spawned simultaneously.

One of possible fixes I can think of is to set priority for the rpc messages and have a set of "reserved" threads which can be used only for high-priority messages (network-vif-plugged for example).
Another way (maybe a bit simpler) is to monitor in the number of available threads in the pool being able to provide extra threads in case pool becomes empty.
Not sure how this can be fixed in Nova.