VMs fail to get metadata in large scale environments

Bug #1526855 reported by Ed Bak
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Swaminathan Vasudevan

Bug Description

In large scale environments, instances can fail to get their metadata. Tests were performed in a 100 compute node environment creating 4000 vms. 15-20 vms will fail all 20 metadata request attempts. This has been reproduced multiple times with similar results. All of the vms successfully obtain a private ip and are pingable but a small number of vms fail to be reachable via ssh. Increasing the number of metadata request attempts in a Cirros test image shows that the metadata requests eventually will succeed. It just takes a long time in some cases. Below is a list of instance uuids I collected from one test and the number of metadata request attempts it took to be successful. These tests were performed using stable/liberty with a small number of patches cherry-picked from Mitaka which were intended to improve performance.

705c3582-f39b-482d-9a6e-d78bc033d3e7 5
27f93574-19fe-4b88-ad6e-c518022ef66a 2
ff668db8-196e-4ec3-82d9-f7ab5a302305 57
b3f97acb-6374-4406-9474-7bacfc3486cd 42
80c19187-7c19-4adc-ad3a-51342f00d799 51
071f60d5-2a9a-4448-b14b-9016c9eee4eb 47
d39f336e-0fb4-4934-b835-e791661d60f1 36
a5627d9f-fd2d-48b0-ada2-f519a97849ee 5
3c24145e-8e11-4e79-8618-fca0416ea030 41
a36ab8fd-4e53-4265-a2bf-6945ac5d8811 46
a9400361-8941-4f03-b11d-0940b5499b4b 37
7449efbd-1df6-4fcc-88d5-e4e355ae7963 24
45c6a108-c18b-4284-9ede-3e5f8d7851be 30
fbe7c6fc-6aec-464c-87b7-0800836f7754 7
cb5a3a49-45b9-40de-8c62-903bee1925f4 37
0c7151ce-79dc-4d55-a617-7f4182cb2194 14
0f1c24a0-3b97-4d56-8feb-b30d67cf6852 44
8c359465-198f-4654-84bb-f334f0400d58 10
b3a5a3df-28c4-40c3-adba-856a0fcbd29e 55
38ee6525-441e-4640-a998-ad89b8d3f8be 2
07ecde16-c274-481e-8169-4febb15c7273 48
f77cd7aa-89e2-4d2c-a89f-e19ff430e5a4 31
b9acdba1-1794-4fa8-bbe3-ffb94f86d19b 3
30824aa6-3df5-4a43-a701-dd33da7f704f 13
5216ffc0-4a8d-4a3e-a4e3-5473b96ca47b 40
999512ff-70e3-4cfd-9cb4-c5788a02fee6 4

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I'm trying to assess the importance of this bug.

Could you provide some more information about how the 4000 vms were created? All at once? As quickly as possible? If you take more time to create 4000 instances, do you expect that all 4000 would succeed?

Also, to determine whether this gets worse with load let me ask this question. If I booted 400 vms in the same way on the same infrastructure, would I expect 1-2 vms to fail in in the same way or would you expect them all to succeed? If I booted 8000 would I expect 30-40 failures or you think it will blow up to a higher percentage of failures?

Are Neutron routers in use? Are they DVR or non-DVR?

Is there any indication why the request fails? Do the failed requests not reach the metadata server or is the metadata server too busy to handle the request? Any more information about the failure mode for a request could help.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Ed Bak (ed-bak2) wrote :

The 4000 vms are created using 10 simultaneous clients. Each client issues a nova boot command, waits for the vm to become ACTIVE, and then associates a floating ip. There is a 20 second sleep between each nova boot.

I'll have to report back on the results of slowing down the rate of vm creation.

I have seen failures at lower numbers of vms although this is not as common.

I can only speculate what would happen at 8000 vms. If I can get data at that level, I'll report back.

DVR Neutron routers are in use.

I don't see the failed metadata request attempts getting logged by the metadata proxy. Once the proxy accepts the connection from the instance, I see the "accepted" message and the GET request and the metadata response is successfully returned.

Revision history for this message
Ed Bak (ed-bak2) wrote :

I reproduced this issue in a smaller ( 10 compute node ) environment than where this was first observed.

Using 10 simultaneous clients with a 20 second sleep between nova boot commands, 14/4000 instances failed to retrieve their metadata and as a result were not accessible via ssh. This was with a DVR router.

Slowing down the rate of VM creation to 5 simultaneous clients with a 30 second sleep between each nova boot did not improve the situation. Once again this was with a DVR router. Small numbers of metadata retries were observed in some of the vm console logs with less than 1000 vms in the environment. Retries first exceeded the 20 retry limit at around 2300 vms.

Using a non-distributed/centralized router, I did not observe the issue. All vms booted successfully and obtained their metadata on the first attempt.

Based on the log files, the instances seem to be unsuccessful reaching the metadata proxy for extended periods of time. Once the connection to the metadata proxy is accepted, the metadata is always successfully returned.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Hi Ed, there was fix that went into the neutron recently that was addressed by Oleg, which is related to metadata.

https://review.openstack.org/#/c/255261/

Along with this patch could you also try the patch below which would be merging soon.

https://review.openstack.org/#/c/253685/

Revision history for this message
Ed Bak (ed-bak2) wrote :

With the patches suggested above by Swami, I am now seeing the following set of VMs retrying for metadata more than the default 20 retries. This is from a test which creates 4000 instances. The issues began somewhere after ~2000 VMs. This is with a distributed router. I verified once again that this issue does not occur with a legacy router ( distributed=False). This is a little better but I'm still seeing some failures.

Instance UUID # of metadata retries
7e19baed-e4e5-47be-8e37-9a81cac94943 31
9d5236d4-bc11-4b01-a216-2d37e2c7a006 2
aba89ecc-f089-40ed-9a3b-3c18e6abaa92 35
4b704e14-dc7d-40a5-880e-4474c7eb510d 41
1fd548a2-e82a-4f55-971f-6e51683596f9 21
a636492f-395d-4256-9418-e490644d9db7 52
8466262e-82aa-479e-a6af-f5ff0d20aa2a 36
27a005cc-03aa-4106-9de0-8541a399dfcb 47
58f0610f-4ae8-42f7-9a67-5623ecc921bd 40

tags: added: l3-dvr-backlog
Changed in neutron:
status: Incomplete → Confirmed
status: Confirmed → New
tags: added: loadimpact
Changed in neutron:
status: New → Confirmed
Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
importance: Undecided → Medium
tags: added: scale
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

One other question I have on this bug is due to the dynamic nature of the DVR routers that are geting created when a service port pops up, is there any delay in the VMs to get to the proxy.

Is it for the new routers attached to the VMs subnet or also for the old routers that pre-exist cause issues.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

It seems the delay happens only for the new routers that are yet to be created on the compute node for the service vms that pop up. Once the router is in place there is no delay in reaching the metadata.

In this case we need to figure out what is the time it takes for a new router to pop up after the VM port has been created on a compute node and if that time delta is less than the 20 retries by the VM to reach the metadata.

Revision history for this message
Brian Haley (brian-haley) wrote :

I am going to close this bug, partly because it is so old without any updates, but also because there have been a number of improvements over the past few cycles wrt scaling that this is probably not as much an issue any more.

Changed in neutron:
status: Confirmed → Won't Fix
Revision history for this message
Mithil Arun (arun-mithil) wrote :

@brain-haley, We continue to see this on Rocky. Are there specific commits that you think might've fixed this issue? I'd like to cherry-pick those fixes back if they are available on later releases.

Revision history for this message
Oleg Bondarev (obondarev) wrote :

@arun-mithil could you please describe the test you're performing? Does it only create new VMs, or creates and deletes periodically? In what batches are VMs created? what are the results?

Revision history for this message
Mithil Arun (arun-mithil) wrote :

@obondarev We hit this during our daily integration tests where we create and delete VMs on a test network attached to a DVR router. We run into this almost always on the first VM to be created on that host that is attached to the test network.

We narrowed this down to the fact that neutron-l3-agent spins down any unused DVR routers on hosts until they are needed again. When the first VM on the network is scheduled to run on that host, the router namespace and associated metadata-proxy process is also spun up in parallel. Unfortunately, the VM comes up and sends metadata requests much before the router namespace is set up, causing cloud-init to fail.

Interestingly, we were able to work around this by deploying cirros VMs on all hosts to ensure that the router namespaces are never cleaned up. That's quite ugly though.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.