intermittent ssh failures in various scenario tests

Bug #1821912 reported by Slawek Kaplonski on 2019-03-27
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
LIU Yulong
tags: added: gate-failure tempest
removed: gate
tags: added: l3-ha
tags: added: l3-dvr-backlog
removed: l3-ha
Miguel Lavalle (minsel) wrote :

This bug is meant to be a team effort, since we don't have a pattern that fits all the intermittent ssh failures we see in our tests. So, if you want to help, just use the logstash query in the bug description and find something to debug. Please keep us informed of what you are working on by writing updates in this bug. Thanks :-)

Slawek Kaplonski (slaweq) wrote :

From what I was checking, in all those cases (or almost all) instance was able to reach metadata server so connectivity from instance to router was fine.
So, I think that issue is somewhere between "tempest runner"-- router, or maybe in router it self FIP is badly configured but that I'm not sure.

We should try to find some way to do additional debugging when issue happens. Something line dump configuration of interfaces e.g. in router when this issue happens.
Also maybe we could do some DNM patch with adding tcpdump on every qg-XXX and qr-XXX interface in router's namespace to check later where packets are dropped.

That are only some ideas how to maybe start debugging this issue but if someone would have something else than it's of course also great :)

LIU Yulong (dragon889) wrote :
Download full text (4.2 KiB)

http://logs.openstack.org/04/637304/7/check/neutron-tempest-plugin-scenario-linuxbridge/e562437/testr_results.html.gz
http://logs.openstack.org/57/651357/1/check/neutron-tempest-dvr-ha-multinode-full/3a0f4ee/testr_results.html.gz
http://logs.openstack.org/82/643882/4/check/tempest-slow-py3/3432a15/testr_results.html.gz

You may noticed, these are 3 different check, but they have some similar fail LOGs. Here are what I found:
(1) all of them have the following floating IP DOWN state log:

Response - Headers: {'content-type': 'application/json', 'content-length': '875', 'x-openstack-request-id': 'req-51260e4b-e58a-4010-89f9-8b721cfbf274', 'date': 'Mon, 25 Mar 2019 15:14:41 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://158.69.67.5:9696/v2.0/floatingips'}
        Body: b'{"floatingip": {"id": "2819b03c-efc1-4b9c-b3f1-709be52d3226", "tenant_id": "3a06df8878b94df5888472572bad7661", "floating_ip_address": "172.24.5.232", "floating_network_id": "beea0fd8-378a-4c91-a820-dde4f9759c8d", "router_id": "60850e98-af58-4c4c-8f02-e613e74d070a", "port_id": "dd095439-8cd1-4273-a9d4-638171bacb61", "fixed_ip_address": "10.1.0.3", "status": "DOWN", "project_id": "3a06df8878b94df5888472572bad7661", "description": "", "qos_policy_id": null, "port_details": {"name": "tempest-port-1415223031", "network_id": "97a59130-9f64-469b-acc2-e8e5c7f101ec", "mac_address": "fa:16:3e:77:8f:f5", "admin_state_up": true, "status": "ACTIVE", "device_id": "703c47eb-fb28-432e-bc02-c6b7308bbf22", "device_owner": "compute:nova"}, "dns_domain": "", "dns_name": "", "tags": [], "created_at": "2019-03-25T15:14:36Z", "updated_at": "2019-03-25T15:14:36Z", "revision_number": 0}}'

Response - Headers: {u'content-type': 'application/json', u'server': 'Apache/2.4.29 (Ubuntu)', u'date': 'Tue, 09 Apr 2019 19:59:03 GMT', u'content-length': '794', 'status': '200', u'x-openstack-request-id': 'req-dd2f3b1f-cf38-48d9-89fc-fcf01f74191d', 'content-location': 'https://149.202.189.177:9696/v2.0/floatingips/58373a20-471b-4c07-943f-003b5406d1e3', u'connection': 'close'}
        Body: {"floatingip": {"router_id": "830816db-b9ed-4683-8e13-24f5e3d07cad", "status": "DOWN", "description": "", "tags": [], "tenant_id": "7f9af55aaed34654b39faeaa75df5159", "created_at": "2019-04-09T19:58:38Z", "updated_at": "2019-04-09T19:59:02Z", "floating_network_id": "5dc501ef-3da6-4d88-8b72-189d5b1f75e4", "port_details": {"status": "ACTIVE", "name": "", "admin_state_up": true, "network_id": "012b87c3-3dff-4929-83c6-df007b6b4ffe", "device_owner": "compute:nova", "mac_address": "fa:16:3e:1e:cd:83", "device_id": "fe17980f-c80a-40ac-b300-27e09a470c91"}, "fixed_ip_address": "10.1.0.8", "floating_ip_address": "172.24.5.88", "revision_number": 1, "project_id": "7f9af55aaed34654b39faeaa75df5159", "port_id": "24431c68-dbe1-4546-88ff-40457f3992f3", "id": "58373a20-471b-4c07-943f-003b5406d1e3"}}

2019-04-04 08:30:08,077 28397 INFO [tempest.scenario.manager] FloatingIP: {'id': 'aab6996a-a477-4b7f-a8f6-dda02091d18b', 'tenant_id': 'e590fad8c3984c84adfe3bad5f3830ac', 'floating_ip_address': '172.24.5.121', 'floating_network_id': '27f82c48-3f0d-4bc6-8839-dc5498191f63', 'router_id': '9709370a-babb-4891-aa9f-2ff51a84b7...

Read more...

LIU Yulong (dragon889) wrote :

And I also noticed that some router was not actually set up done, but nova was booting the VM up. This may relate to the bug:
[L3] DVR router in compute node was not up but nova port needs its functionality
https://bugs.launchpad.net/neutron/+bug/1813787
Mostly, the ssh keypair setting may rely on this.

Slawek Kaplonski (slaweq) wrote :

@LIU: please not that in Your 3rd example (http://logs.openstack.org/82/643882/4/check/tempest-slow-py3/3432a15/testr_results.html.gz) it was different (and known) issue. VM got kernel panic. So that case can be easily moved out from our analysis process :)

This finding with FIP in DOWN state is very interesting. I wonder why it is in such state.

LIU Yulong (dragon889) wrote :

http://logs.openstack.org/04/637304/7/check/neutron-tempest-plugin-scenario-linuxbridge/e562437/testr_results.html.gz
http://logs.openstack.org/57/651357/1/check/neutron-tempest-dvr-ha-multinode-full/3a0f4ee/testr_results.html.gz

Both have metadata request related failed log:
(1)Lease of 10.10.210.139 obtained, lease time 86400
cirros-ds 'net' up at 14.64
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 15.12. request failed
.....
failed 20/20: up 97.28. request failed
failed to read iid from metadata. tried 20

This may because router is not UP. (bug/1813787)

(2) Lease of 10.1.0.8 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 49.77. iid=i-00000004
failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys
warning: no ec2 metadata for public-keys

LIU Yulong (dragon889) wrote :

message:"line 107, in _get_ssh_connection" AND build_status:"FAILURE" AND build_branch:"master" AND NOT build_name:"neutron-tempest-plugin-dvr-multinode-scenario" AND project:"openstack/neutron" AND NOT build_name:"networking-ovn-tempest-dsvm-ovs-release"

Let's use this search to skip the `ovn` tempest logs if necessary.

LIU Yulong (dragon889) wrote :

An example: try to ssh the VM but the cirros is just booting.
http://logs.openstack.org/56/648456/4/check/tempest-full/ad806b2/testr_results.html.gz

Related fix proposed to branch: master
Review: https://review.opendev.org/654815

Slawek Kaplonski (slaweq) wrote :

Today I spent some time to investigate one of the problems which I found.

It was http://logs.openstack.org/47/653947/3/gate/tempest-slow-py3/629abee/testr_results.html.gz and as can be checked there, test failed due to paramiko.ssh_exception.AuthenticationException and that means that problem was no with connection to instance but with authentication to guest vm as cirros user.

So I looked into instance's console log and I found something like:

Sending discover...
Sending select for 10.1.0.10...
Lease of 10.1.0.10 obtained, lease time 86400
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 19.66. iid=i-00000010
failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys
warning: no ec2 metadata for public-keys
failed to get http://169.254.169.254/2009-04-04/user-data
warning: no ec2 metadata for user-data

So it looks that reason of the problem was that public-keys wasn't provided by metadata service to the instance.
What is also interesting is fact that connectivity to metadata in overall worked as instance-id was fetched from there properly.

So lets check now neutron-metadata-agent logs. I found there that request for public-keys was there:
http://logs.openstack.org/47/653947/3/gate/tempest-slow-py3/629abee/controller/logs/screen-q-meta.txt.gz#_Apr_25_14_56_39_349166
And neutron metadata agent properly passed it to nova metadata server.

Lets then check nova metadata api logs: http://logs.openstack.org/47/653947/3/gate/tempest-slow-py3/629abee/controller/logs/screen-n-api-meta.txt.gz#_Apr_25_14_56_39_902289
Request comes there from neutron-metadata agent but I don't see any response for it.
So maybe in this case it is some issue on nova's side? I think we should ask someone from nova to take a look into this.

@Slaweq: I think the public-keys query was answered by nova but it was delayed. I think this is the answer http://logs.openstack.org/47/653947/3/gate/tempest-slow-py3/629abee/controller/logs/screen-n-api-meta.txt.gz#_Apr_25_14_56_39_902289

Apr 25 14:56:52.399759 ubuntu-bionic-ovh-gra1-0005704258 <email address hidden>[16939]: [pid: 16940|app: 0|req: 154/313] 149.202.178.30 () {40 vars in 705 bytes} [Thu Apr 25 14:56:39 2019] GET /2009-04-04/meta-data/public-keys => generated 49 bytes in 12514 msecs (HTTP/1.1 200) 3 headers in 99 bytes (1 switches on core 0)

For some reason it took 12.5 seconds for nova to answer the request. But I don't see the direct reason why.

There is a sizable gap in the nova logs between Apr 25 14:56:39.908640 - Apr 25 14:56:51.978379 where the nova metadata code does not log anything.

Around the same time in the nova-api logs there is also a hickup visible.
http://logs.openstack.org/47/653947/3/gate/tempest-slow-py3/629abee/controller/logs/screen-n-api.txt.gz#_Apr_25_14_56_37_968444

"AMQP server on 149.202.178.30:5672 is unreachable"

Could this be a hickup of the VM hosting the devstack or the hickup of the underlying cloud?

Reviewed: https://review.opendev.org/654815
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Submitter: Zuul
Branch: master

commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef

Slawek Kaplonski (slaweq) wrote :

I focused on analysis of issue with Authentication Failures.
First I tried to reproduce this issue locally to check if slow answer from nova-metadata-api can really cause this problem. So I added simple time.sleep(15) in https://github.com/openstack/nova/blob/e5b819fb75320f192f79e92384bac8be043e9600/nova/api/metadata/handler.py#L94 and I got exactly same error coming from "ec2metadata" script which is in cirrus image.
So it looks that really some kind of slowdown in nova-metadata-api can cause this problem.

Next step was to check other similar failures. I found them with query like http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22failed%20to%20get%20http%3A%2F%2F169.254.169.254%2F2009-04-04%2Fmeta-data%2Fpublic-keys%5C%22

I checked logs from http://logs.openstack.org/66/656666/1/check/tempest-multinode-full/4d77549/job-output.txt and I saw that there was also about 12 seconds of "nothing" in nova-metadata-logs here: http://logs.openstack.org/66/656666/1/check/tempest-multinode-full/4d77549/controller/logs/screen-n-api-meta.txt.gz#_May_01_19_37_47_835271 (it matched time when broken instance was trying to get public-keys from metadata).
I also found that in almost same time there was also gap in logs of nova-api logs: http://logs.openstack.org/66/656666/1/check/tempest-multinode-full/4d77549/controller/logs/screen-n-api.txt.gz#_May_01_19_37_44_331631 and after that some errors with rabbitmq connection. Maybe those are related.

In the other case which I checked, it wasn't any rabbitmq error in nova logs but request for public-keys took more than 17s in neutron-metadata-logs: http://logs.openstack.org/91/655791/4/check/neutron-tempest-linuxbridge/1ce0a65/logs/screen-q-meta.txt.gz#_May_02_21_35_14_860194

So, conclusion is that those errors comes for sure from slowdown in metadata service, either on neutron or nova's side. But for now I still don't know what the reason (or reasons) of slowdown are.

Fix proposed to branch: master
Review: https://review.opendev.org/659009

Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
status: Confirmed → In Progress
LIU Yulong (dragon889) wrote :

Interim summary:
1. floating IP is DOWN status while trying to ssh
2. VM is still booting while trying to ssh
3. VM port is still DOWN status while trying to ssh
4. router processing is time-consuming may cause instance failed to retrieve metedata

Reviewed: https://review.opendev.org/660758
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Submitter: Zuul
Branch: stable/stein

commit fe4fc33f1c1bbb655e99733edf762e3e9debdd3e
Author: LIU Yulong <email address hidden>
Date: Tue Apr 23 15:27:02 2019 +0800

    Add update_id for ResourceUpdate

    Add a unique id for resource update, then we can calculate
    the resource processing time and track it.

    Related-Bug: #1825152
    Related-Bug: #1824911
    Related-Bug: #1821912
    Related-Bug: #1813787

    Change-Id: Ib4d197c6c180c32860964440882393794aabb6ef
    (cherry picked from commit 9d60716cf1d61286a684f20ef8e05c77a0df5aa3)

tags: added: in-stable-stein
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers