Metadata responses are very slow sometimes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Expired
|
Undecided
|
Unassigned |
Bug Description
It happens from time to time in CI that VM spawned in test don't get public-keys from metadata service. Due to that SSH to instance using ssh-key is not possible thus test fails.
Example of such failures:
http://
http://
http://
In each of those cases it looks that neutron metadata agent was waiting more than 10 seconds for response from n-api-meta service:
I have no idea why nova is responding so slow but it would be worth if someone from Nova team would take a look into that.
Logstash query which can help to find another examples: http://
It is possible that in some cases of failed tests, reason of failure may be different but problem described above is quite common in those failed tests IMO.
Changed in neutron: | |
assignee: | nobody → Slawek Kaplonski (slaweq) |
importance: | Undecided → High |
tags: | added: gate-failure loadim |
tags: |
added: loadimpact removed: loadim |
Changed in neutron: | |
status: | New → Confirmed |
no longer affects: | neutron |
As part of building the instnace metadata nova calls neutron to retirve the security groups for an instnace. /github. com/openstack/ nova/blob/ master/ nova/api/ metadata/ base.py# L145 /github. com/openstack/ nova/blob/ 46a3bcd80b41e99 ec4923c7cf3d0f8 dd8505e97c/ nova/network/ security_ group/neutron_ driver. py#L374- L409
https:/
that function first looks up all the ports associated with an instance then looks up the security groups for each port
https:/
in the case of the metadata request that took 16s to respond logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- n-api-meta. txt.gz# _Jul_11_ 23_43_16_ 704979
http://
if we look 16 second earlier in the log we fine this request that was dispatched to the api worker with pid 9232 logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- n-api-meta. txt.gz# _Jul_11_ 23_43_00_ 523076
which is the same as the request that took 16 seconds.
http://
from that request we can extrtract the instance uuid which is 'f123a13b- f41f-4b32- a117-dd9ba21148 32'
this is the first request that was recived by that api worker and it took about 1.5 seconds for it to setup the db connect which completeed at logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- n-api-meta. txt.gz# _Jul_11_ 23_43_01_ 357100
http://
if we look at the neutron log we can see the request form nova to look up the instace ports here logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- q-svc.txt. gz#_Jul_ 11_23_43_ 04_463505
http://
we know this is the request because the port device_id matches the instace uuid logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- q-svc.txt. gz#_Jul_ 11_23_43_ 04_575879
then imeditely after we have request for the security groups of that port
http://
that request does not complete for 12.166260 seconds logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- q-svc.txt. gz#_Jul_ 11_23_43_ 16_660751
http://
so neutron completes the security group request at 23:43:16.660751 and nova completes
rendering all the metadata 0.044228 seconds later at 23:43:16.704979
in the case of http:// logs.openstack. org/45/ 645645/ 8/gate/ tempest- full/4d7874e/ controller/ logs/screen- n-api-meta. txt.gz# _Jul_09_ 01_43_56_ 549916 the instance uuid was 1a88f13c- 5609-4a4e- 9182-33654a7d41 88
i was not able to find a similar directly related issue however between starting the metadata request and finsihing it there was a 12.5s network delete request interleaved logs.openstack. org/45/ 645645/ 8/gate/ tempest- full/4d7874e/ controller/ logs/screen- q-svc.txt. gz#_Jul_ 09_01_43_ 51_804071
http://
we get the same waring for the other case i was not able to root cause logs.openstack. org/09/ 666409/ 7/check/ tempest- full/08f4c53/ controller/ logs/screen- n-super- cond.txt. gz#_Jul_ 11_23_42_ 39_020917 at about the right time.
http://
im not sure there is much more that we can see from the logs so we would need a reliable way to repoduce this to debug it futher.
for http:// logs.openstack. org/35/ 521035/ 8/check/ tempest- full/031b0b9/ controller/ logs/screen- n-api-meta. txt.gz# _Jul_09_ 17_23_47_ 871054 the ...