In large scale environments, instances can fail to get their metadata. Tests were performed in a 100 compute node environment creating 4000 vms. 15-20 vms will fail all 20 metadata request attempts. This has been reproduced multiple times with similar results. All of the vms successfully obtain a private ip and are pingable but a small number of vms fail to be reachable via ssh. Increasing the number of metadata request attempts in a Cirros test image shows that the metadata requests eventually will succeed. It just takes a long time in some cases. Below is a list of instance uuids I collected from one test and the number of metadata request attempts it took to be successful. These tests were performed using stable/liberty with a small number of patches cherry-picked from Mitaka which were intended to improve performance.
705c3582-f39b-482d-9a6e-d78bc033d3e7 5
27f93574-19fe-4b88-ad6e-c518022ef66a 2
ff668db8-196e-4ec3-82d9-f7ab5a302305 57
b3f97acb-6374-4406-9474-7bacfc3486cd 42
80c19187-7c19-4adc-ad3a-51342f00d799 51
071f60d5-2a9a-4448-b14b-9016c9eee4eb 47
d39f336e-0fb4-4934-b835-e791661d60f1 36
a5627d9f-fd2d-48b0-ada2-f519a97849ee 5
3c24145e-8e11-4e79-8618-fca0416ea030 41
a36ab8fd-4e53-4265-a2bf-6945ac5d8811 46
a9400361-8941-4f03-b11d-0940b5499b4b 37
7449efbd-1df6-4fcc-88d5-e4e355ae7963 24
45c6a108-c18b-4284-9ede-3e5f8d7851be 30
fbe7c6fc-6aec-464c-87b7-0800836f7754 7
cb5a3a49-45b9-40de-8c62-903bee1925f4 37
0c7151ce-79dc-4d55-a617-7f4182cb2194 14
0f1c24a0-3b97-4d56-8feb-b30d67cf6852 44
8c359465-198f-4654-84bb-f334f0400d58 10
b3a5a3df-28c4-40c3-adba-856a0fcbd29e 55
38ee6525-441e-4640-a998-ad89b8d3f8be 2
07ecde16-c274-481e-8169-4febb15c7273 48
f77cd7aa-89e2-4d2c-a89f-e19ff430e5a4 31
b9acdba1-1794-4fa8-bbe3-ffb94f86d19b 3
30824aa6-3df5-4a43-a701-dd33da7f704f 13
5216ffc0-4a8d-4a3e-a4e3-5473b96ca47b 40
999512ff-70e3-4cfd-9cb4-c5788a02fee6 4
I'm trying to assess the importance of this bug.
Could you provide some more information about how the 4000 vms were created? All at once? As quickly as possible? If you take more time to create 4000 instances, do you expect that all 4000 would succeed?
Also, to determine whether this gets worse with load let me ask this question. If I booted 400 vms in the same way on the same infrastructure, would I expect 1-2 vms to fail in in the same way or would you expect them all to succeed? If I booted 8000 would I expect 30-40 failures or you think it will blow up to a higher percentage of failures?
Are Neutron routers in use? Are they DVR or non-DVR?
Is there any indication why the request fails? Do the failed requests not reach the metadata server or is the metadata server too busy to handle the request? Any more information about the failure mode for a request could help.