There is a race condition between nova-compute boots instance and l3-agent processes DVR (local) router in compute node.
This issue can be seen when a large number of instances were booted to one same host, and instances are under different DVR router.
So the l3-agent will concurrently process all these dvr router in this host at the same time.
Although we have a green pool for the router ResourceProcessingQueue with 8 greenlet,
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L642
some of these routers can still be waiting, event worse thing is that there are time-consuming actions during the router processing procedure.
For instance, installing arp entrys, iptables rules, route rules etc.
So when the VM is up, it will try to get meta via the local proxy hosting by the dvr router. But the router is not ready yet in that host.
And finally those instances will not be able to setup some config in the guest OS.
Some potential solutions:
(1) increase that green pool room
(2) still (provisioning) block the VM port to be set to ACTIVE until the dvr router is up in that host for the first one.
I think (1) and (2) will both help, with the provisioning block probably working in more cases perhaps.
The other change is batching the DVR ARP entry processing, which there is another bug for and has been proposed a couple of times, but nothing has merged yet. I believe that is one of the longer operations when creating the router.
Maybe figuring out where most time is spent is a good first step so we can prioritize what area to address first.