Metadata responses are very slow sometimes

Bug #1836642 reported by Slawek Kaplonski on 2019-07-15
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
neutron
High
Slawek Kaplonski

Bug Description

It happens from time to time in CI that VM spawned in test don't get public-keys from metadata service. Due to that SSH to instance using ssh-key is not possible thus test fails.

Example of such failures:

http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/testr_results.html.gz
http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/testr_results.html.gz
http://logs.openstack.org/45/645645/8/gate/tempest-full/4d7874e/testr_results.html.gz

In each of those cases it looks that neutron metadata agent was waiting more than 10 seconds for response from n-api-meta service:

http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_16_704979 ~ 16 seconds,

http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-n-api-meta.txt.gz#_Jul_09_17_23_47_871054 ~ 13 seconds,

http://logs.openstack.org/45/645645/8/gate/tempest-full/4d7874e/controller/logs/screen-n-api-meta.txt.gz#_Jul_09_01_43_56_549916 ~ 17 seconds.

I have no idea why nova is responding so slow but it would be worth if someone from Nova team would take a look into that.

Logstash query which can help to find another examples: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22failed%20to%20get%20http%3A%2F%2F169.254.169.254%2F2009-04-04%2Fmeta-data%2Fpublic-keys%5C%22

It is possible that in some cases of failed tests, reason of failure may be different but problem described above is quite common in those failed tests IMO.

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
importance: Undecided → High
sean mooney (sean-k-mooney) wrote :
Download full text (3.9 KiB)

As part of building the instnace metadata nova calls neutron to retirve the security groups for an instnace.
https://github.com/openstack/nova/blob/master/nova/api/metadata/base.py#L145
that function first looks up all the ports associated with an instance then looks up the security groups for each port
  https://github.com/openstack/nova/blob/46a3bcd80b41e99ec4923c7cf3d0f8dd8505e97c/nova/network/security_group/neutron_driver.py#L374-L409

in the case of the metadata request that took 16s to respond
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_16_704979

if we look 16 second earlier in the log we fine this request that was dispatched to the api worker with pid 9232
which is the same as the request that took 16 seconds.
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_00_523076

from that request we can extrtract the instance uuid which is 'f123a13b-f41f-4b32-a117-dd9ba2114832'

this is the first request that was recived by that api worker and it took about 1.5 seconds for it to setup the db connect which completeed at
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-api-meta.txt.gz#_Jul_11_23_43_01_357100

if we look at the neutron log we can see the request form nova to look up the instace ports here
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-q-svc.txt.gz#_Jul_11_23_43_04_463505

we know this is the request because the port device_id matches the instace uuid
then imeditely after we have request for the security groups of that port
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-q-svc.txt.gz#_Jul_11_23_43_04_575879

that request does not complete for 12.166260 seconds
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-q-svc.txt.gz#_Jul_11_23_43_16_660751

so neutron completes the security group request at 23:43:16.660751 and nova completes
rendering all the metadata 0.044228 seconds later at 23:43:16.704979

in the case of http://logs.openstack.org/45/645645/8/gate/tempest-full/4d7874e/controller/logs/screen-n-api-meta.txt.gz#_Jul_09_01_43_56_549916 the instance uuid was 1a88f13c-5609-4a4e-9182-33654a7d4188

i was not able to find a similar directly related issue however between starting the metadata request and finsihing it there was a 12.5s network delete request interleaved
http://logs.openstack.org/45/645645/8/gate/tempest-full/4d7874e/controller/logs/screen-q-svc.txt.gz#_Jul_09_01_43_51_804071

we get the same waring for the other case i was not able to root cause
http://logs.openstack.org/09/666409/7/check/tempest-full/08f4c53/controller/logs/screen-n-super-cond.txt.gz#_Jul_11_23_42_39_020917 at about the right time.

im not sure there is much more that we can see from the logs so we would need a reliable way to repoduce this to debug it futher.

for http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-n-api-meta.txt.gz#_Jul_09_17_23_47_871054 the ...

Read more...

Changed in nova:
status: New → Incomplete
tags: added: gate-failure loadim
tags: added: loadimpact
removed: loadim
Hongbin Lu (hongbin.lu) on 2019-07-17
Changed in neutron:
status: New → Confirmed
Slawek Kaplonski (slaweq) wrote :

Hi Sean,

I was looking at another example of similar failure (13 seconds respond pointed out by LIU in his last comment).
It is here: http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-q-meta.txt.gz#_Jul_09_17_23_47_872908

In n-meta-api logs it's here: http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-n-api-meta.txt.gz#_Jul_09_17_23_47_871054

And in neutron, related calls to get e.g. security groups are quite fast: http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-q-svc.txt.gz#_Jul_09_17_23_38_007083 and http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-q-svc.txt.gz#_Jul_09_17_23_38_010680

So it's not neutron issue this time and nova is still responding very slow.

Also I can't understand one thing: why nova needs to get security groups of instance when there is request GET /2009-04-04/meta-data/public-keys done from VM. It doesn't require security groups at all. Isn't that a bug in nova?

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/671300

sean mooney (sean-k-mooney) wrote :

nova build the full set of metadata then caches it rather then building the minimal set for a given endpoint. this is required for config drive so no i dont belive its a bug that nova is getting the secuirty groups in this instance as it only does so when there is no metadata object for that instnace in the cache.

Slawek Kaplonski (slaweq) wrote :

@Sean: so, assuming that this data is cached in nova than it's even more strange for me.

Let's get an example from my previous comment. In neutron metadata agent's log we see request for /public-keys which took 13 seconds:

http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-q-meta.txt.gz#_Jul_09_17_23_47_872908

but just before this call there is also call for instance-id:

http://logs.openstack.org/35/521035/8/check/tempest-full/031b0b9/controller/logs/screen-q-meta.txt.gz#_Jul_09_17_23_47_351149

which took about 2 seconds. So my understanding is that nova should get and cache all data related to this instance when first call was done (for instance-id), right or am I miss understood somethibg here? And if my understanding is correct, what could took for nova 13 seconds to process second call than?

Slawek Kaplonski (slaweq) wrote :

After some talk with Sean on IRC, here is another idea to fix this in gate: https://review.opendev.org/672715

Matt Riedemann (mriedem) wrote :

This is probably a duplicate of bug 1808010.

This issue was fixed in the openstack/tempest 22.0.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers