Bug #1355573 “tempest volume scenarios periodically fail /w SSHT...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-05-21:

#1

Yeah this is a known thing, the ssh timeout failure is generic so it's hard to classify in elastic-recheck:

http://logs.openstack.org/71/179971/2/gate/gate-grenade-dsvm/eb2fb74/logs/grenade.sh.txt.gz#_2015-05-20_07_26_25_655

I think we'll need a change to tempest to log a specific error for the test in this case so we can track this bug.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-05-21:

#2

Here is a change to tempest to add some more contextual logging around the error:

https://review.openstack.org/#/c/184815/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-05-22:

#3

The tempest logging change is merged, here is an updated query with the specific test that's failing:

http://logstash.openstack.org/#eyJzZWFyY2giOiIobWVzc2FnZTpcIlRlc3RWb2x1bWVCb290UGF0dGVybjp0ZXN0X3ZvbHVtZV9ib290X3BhdHRlcm5cIiBPUiBtZXNzYWdlOlwiVGVzdFZvbHVtZUJvb3RQYXR0ZXJuVjI6dGVzdF92b2x1bWVfYm9vdF9wYXR0ZXJuXCIpIEFORCBtZXNzYWdlOlwiSW5pdGlhbGl6aW5nIFNTSCBjb25uZWN0aW9uIHRvXCIgQU5EIG1lc3NhZ2U6XCJ2aWEgU1NIIHRpbWVkIG91dC5cIiBBTkQgdGFnczpcInRlbXBlc3QudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MzIzMDkxMzEzMjYsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

Sean Dague (sdague) on 2015-12-02

summary:

- grenade volume scenarios periodically fail /w SSHTimeout
+ tempest volume scenarios periodically fail /w SSHTimeout

Revision history for this message

Tony Breeds (o-tony) wrote on 2016-01-11:

#4

Added to nova as it's not grenade specific and I saw it in a tempest-dvvm-full job

Fahri Cihan Demirci (cihand) on 2016-01-11

tags:

added: testing

Augustina Ragwitz (auggy) on 2016-02-05

Changed in nova:
status:	New → Confirmed

Sean Dague (sdague) on 2016-02-20

Changed in nova:
status:	Confirmed → Won't Fix
Changed in grenade:
status:	New → Invalid

Revision history for this message

Sean Dague (sdague) wrote on 2016-02-22:

#5

None of the links in this bug are valid any more. If there is a relevant new bug we can do that.

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-22:

#6

This report gets old but I am writing some debugging info on this now temporary.

The latest issue is http://logs.openstack.org/72/282872/2/check/gate-tempest-dsvm-postgres-full/db55de4/logs/

The following tests failed:
- test_minimum_basic_scenario
- test_shelve_instance
- test_shelve_volume_backed_instance
- test_snapshot_pattern
- test_volume_boot_pattern
- test_volume_boot_pattern

All tests failed like:

Traceback (most recent call last):
  File "tempest/test.py", line 113, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_shelve_instance.py", line 98, in test_shelve_instance
    self._create_server_then_shelve_and_unshelve()
  File "tempest/scenario/test_shelve_instance.py", line 82, in _create_server_then_shelve_and_unshelve
    private_key=keypair['private_key'])
  File "tempest/scenario/manager.py", line 622, in create_timestamp
    private_key=private_key)
  File "tempest/scenario/manager.py", line 377, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 51, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 173, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 88, in _get_ssh_connection
    password=self.password)
tempest_lib.exceptions.SSHTimeout: Connection to the 172.24.5.10 via SSH timed out.
User: cirros, Password: None

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-22:

#7

picking up tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern here:

date: 2016-02-21 16:55:35.858
server-id: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
ip-address: 172.24.5.9

confirmed the ip-address is allocated to the server:
2016-02-21 16:52:10,047 18343 INFO [tempest_lib.common.rest_client] Request (TestVolumeBootPattern:test_volume_boot_pattern): 202 POST http://127.0.0.1:8774/v2.1/b18fde6446224a348fb327cccbc0bc30/servers/d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd/action 0.945s
2016-02-21 16:52:10,048 18343 DEBUG [tempest_lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
Body: {"addFloatingIp": {"address": "172.24.5.9"}}

Before the timeout, the following message is output several times:
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh Traceback (most recent call last):
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 75, in _get_ssh_connection
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh timeout=self.channel_timeout, pkey=self.pkey)
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/client.py", line 325, in connect
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh t.start_client()
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 492, in start_client
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh raise e
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh SSHException: Error reading SSH protocol banner
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh
2016-02-21 16:55:35,858 18343 ERROR [tempest.scenario.manager] (TestVolumeBootPattern:test_volume_boot_pattern) Initializing SSH connection to 172.24.5.9 failed. Error: Connection to the 172.24.5.9 via SSH timed out.
User: cirros, Password: None

picking up tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern here:

date: 2016-02-21 16:55:35.858
server-id: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
ip-address: 172.24.5.9

confirmed the ip-address is allocated to the server:
2016-02-21 16:52:10,047 18343 INFO     [tempest_lib.common.rest_client] Request (TestVolumeBootPattern:test_volume_boot_pattern): 202 POST http://127.0.0.1:8774/v2.1/b18fde6446224a348fb327cccbc0bc30/servers/d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd/action 0.945s
2016-02-21 16:52:10,048 18343 DEBUG    [tempest_lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
        Body: {"addFloatingIp": {"address": "172.24.5.9"}}

Before the timeout, the following message is output several times:
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh Traceback (most recent call last):
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh   File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 75, in _get_ssh_connection
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh     timeout=self.channel_timeout, pkey=self.pkey)
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh   File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/client.py", line 325, in connect
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh     t.start_client()
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh   File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 492, in start_client
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh     raise e
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh SSHException: Error reading SSH protocol banner
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh 
2016-02-21 16:55:35,858 18343 ERROR    [tempest.scenario.manager] (TestVolumeBootPattern:test_volume_boot_pattern) Initializing SSH connection to 172.24.5.9 failed. Error: Connection to the 172.24.5.9 via SSH timed out.
User: cirros, Password: None

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-22:

#8

2016-02-21 16:54:40,286 18343

ERROR [paramiko.transport] Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer
ERROR [paramiko.transport] Traceback (most recent call last):
ERROR [paramiko.transport] File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 1710, in run
ERROR [paramiko.transport] self._check_banner()
ERROR [paramiko.transport] File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 1858, in _check_banner
ERROR [paramiko.transport] raise SSHException('Error reading SSH protocol banner' + str(e))
ERROR [paramiko.transport] SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
ERROR [paramiko.transport]
WARNING [tempest_lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.9 (Error reading SSH protocol banner[Errno 104] Connection reset by peer). Number attempts: 16. Retry after 17 seconds.

paramiko/transport.py
1844 def _check_banner(self):
1845 # this is slow, but we only have to do it once
1846 for i in range(100):
1847 # give them 15 seconds for the first line, then just 2 seconds
1848 # each additional line. (some sites have very high latency.)
1849 if i == 0:
1850 timeout = self.banner_timeout
1851 else:
1852 timeout = 2
1853 try:
1854 buf = self.packetizer.readline(timeout)
1855 except ProxyCommandFailure:
1856 raise
1857 except Exception as e:
1858 raise SSHException('Error reading SSH protocol banner' + str(e))

The original exception is [Errno 104] Connection reset by peer.

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-23:

#9

screen-c-api.txt.gz when booting from volume:

2016-02-21 16:52:01.832 INFO cinder.api.openstack.wsgi [req-1c9f1cd4-022c-4a8d-9
b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] POST http://127.0.0.1:
8776/v2/b18fde6446224a348fb327cccbc0bc30/volumes/9bed7b85-4472-44cd-9a5b-375010a
5363a/action
2016-02-21 16:52:01.832 DEBUG cinder.api.openstack.wsgi [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] Action body: {"os-attach": {"instance_uuid": "d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd", "mountpoint": "/dev/vda", "mode": "rw"}} get_method /opt/stack/new/cinder/cinder/api/openstack/wsgi.py:1093
2016-02-21 16:52:01.958 DEBUG object [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] Cinder object Volume has no attribute named: type get /opt/stack/new/cinder/cinder/objects/base.py:327
[..]
2016-02-21 16:52:03.108 INFO eventlet.wsgi.server [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] 127.0.0.1 "POST /v2/b18fde6446224a348fb327cccbc0bc30/volumes/9bed7b85-4472-44cd-9a5b-375010a5363a/action HTTP/1.1" status: 202 len: 211 time: 1.2800739

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-23:

#10

The volume: 9bed7b85-4472-44cd-9a5b-375010a5363a
The server: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
The network: 3acd1838-7435-48bb-bc86-2da830b5a1c4
The target address: 172.24.5.9 (Floating ip)

16:51:57.655: (nova-api) Received the request of boot from volume.
16:51:58.615: (nova-api) Succeeded to boot from volume.
16:52:03.108: (cinder-api) Succeeded to attach the volume to the server.
16:52:10.050: (nova-api) Succeeded to add the target floating-ip address to the server.
16:52:13.046: (tempest) First failed to establish authenticated ssh connection.
16:55:35.858: (tempest) tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern failed.

Revision history for this message

Ken'ichi Ohmichi (oomichi) wrote on 2016-02-23:

#11

Download full text (4.0 KiB)

The volume: 9bed7b85-4472-44cd-9a5b-375010a5363a
The server: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
The network: 3acd1838-7435-48bb-bc86-2da830b5a1c4
The fixed-ip: 10.1.0.9
The target address: 172.24.5.9 (Floating ip)

16:51:57.655: (nova-api) Received the request of boot from volume.
16:51:58.615: (nova-api) Succeeded to boot from volume.
16:52:00.618: (nova-net) Allocated fixed IP to the server.
16:52:03.108: (cinder-api) Succeeded to attach the volume to the server.
16:52:10.050: (nova-api) Succeeded to add the target floating-ip address to the server.
16:52:13.046: (tempest) First failed to establish authenticated ssh connection.
16:55:35.858: (tempest) tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern failed.

The console log contains the following:

cirros-ds 'local' up at 8.49
no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.1.0.9...
Lease of 10.1.0.9 obtained, lease time 86400
cirros-ds 'net' up at 10.49
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 10.82. request failed
failed 2/20: up 13.55. request failed
failed 3/20: up 15.76. request failed
failed 4/20: up 18.08. request failed
failed 5/20: up 20.29. request failed
failed 6/20: up 22.75. request failed
failed 7/20: up 24.95. request failed
failed 8/20: up 27.13. request failed
failed 9/20: up 29.56. request failed
failed 10/20: up 31.78. request failed
failed 11/20: up 33.95. request failed
failed 12/20: up 36.13. request failed
failed 13/20: up 38.52. request failed
failed 14/20: up 40.93. request failed
failed 15/20: up 43.13. request failed
failed 16/20: up 45.41. request failed
failed 17/20: up 47.79. request failed
failed 18/20: up 49.98. request failed
failed 19/20: up 52.17. request failed
failed 20/20: up 54.41. request failed
failed to read iid from metadata. tried 20
no results found for mode=net. up 56.82. searched: nocloud configdrive ec2
failed to get instance-id of datasource
Starting dropbear sshd: generating rsa key... generating dsa key... OK
=== system information ===
Platform: OpenStack Foundation OpenStack Nova

The difference between normal ops and this problem ops is:

---------------------
$ diff -u normal.txt problem.txt
--- normal.txt 2016-02-23 09:22:14.920761998 -0800
+++ problem.txt 2016-02-23 09:22:46.568761998 -0800
@@ -1,11 +1,35 @@
-cirros-ds 'local' up at 9.56
-found datasource (configdrive, local)
+cirros-ds 'local' up at 8.49
+no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
Starting network...
udhcpc (v1.20.1) started
Sending discover...
-Sending select for 10.1.8.4...
-Lease of 10.1.8.4 obtained, lease time 86400
+Sending select for 10.1.0.9...
+Lease of 10.1.0.9 obtained, lease time 86400
+cirros-ds 'net' up at 10.49
+checking http://169.254.169.254/2009-04-04/instance-id
+failed 1/20: up 10.82. request failed
+failed 2/20: up 13.55. request failed
+failed 3/20: up 15.76. request failed
+failed 4/20: up 18.08. request failed
+failed 5/20: up 20.29. request failed
+failed 6/20: up 22.75. request failed
+failed 7/20: up 24.95. request...

The volume: 9bed7b85-4472-44cd-9a5b-375010a5363a
The server: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
The network: 3acd1838-7435-48bb-bc86-2da830b5a1c4
The fixed-ip: 10.1.0.9
The target address: 172.24.5.9 (Floating ip)

16:51:57.655: (nova-api) Received the request of boot from volume.
16:51:58.615: (nova-api) Succeeded to boot from volume.
16:52:00.618: (nova-net) Allocated fixed IP to the server.
16:52:03.108: (cinder-api) Succeeded to attach the volume to the server.
16:52:10.050: (nova-api) Succeeded to add the target floating-ip address to the server.
16:52:13.046: (tempest) First failed to establish authenticated ssh connection.
16:55:35.858: (tempest) tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern failed.

The console log contains the following:

cirros-ds 'local' up at 8.49
no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.1.0.9...
Lease of 10.1.0.9 obtained, lease time 86400
cirros-ds 'net' up at 10.49
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 10.82. request failed
failed 2/20: up 13.55. request failed
failed 3/20: up 15.76. request failed
failed 4/20: up 18.08. request failed
failed 5/20: up 20.29. request failed
failed 6/20: up 22.75. request failed
failed 7/20: up 24.95. request failed
failed 8/20: up 27.13. request failed
failed 9/20: up 29.56. request failed
failed 10/20: up 31.78. request failed
failed 11/20: up 33.95. request failed
failed 12/20: up 36.13. request failed
failed 13/20: up 38.52. request failed
failed 14/20: up 40.93. request failed
failed 15/20: up 43.13. request failed
failed 16/20: up 45.41. request failed
failed 17/20: up 47.79. request failed
failed 18/20: up 49.98. request failed
failed 19/20: up 52.17. request failed
failed 20/20: up 54.41. request failed
failed to read iid from metadata. tried 20
no results found for mode=net. up 56.82. searched: nocloud configdrive ec2
failed to get instance-id of datasource
Starting dropbear sshd: generating rsa key... generating dsa key... OK
=== system information ===
Platform: OpenStack Foundation OpenStack Nova

The difference between normal ops and this problem ops is:

---------------------
$ diff -u normal.txt problem.txt
--- normal.txt  2016-02-23 09:22:14.920761998 -0800
+++ problem.txt        2016-02-23 09:22:46.568761998 -0800
@@ -1,11 +1,35 @@
-cirros-ds 'local' up at 9.56
-found datasource (configdrive, local)
+cirros-ds 'local' up at 8.49
+no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
 Starting network...
 udhcpc (v1.20.1) started
 Sending discover...
-Sending select for 10.1.8.4...
-Lease of 10.1.8.4 obtained, lease time 86400
+Sending select for 10.1.0.9...
+Lease of 10.1.0.9 obtained, lease time 86400
+cirros-ds 'net' up at 10.49
+checking http://169.254.169.254/2009-04-04/instance-id
+failed 1/20: up 10.82. request failed
+failed 2/20: up 13.55. request failed
+failed 3/20: up 15.76. request failed
+failed 4/20: up 18.08. request failed
+failed 5/20: up 20.29. request failed
+failed 6/20: up 22.75. request failed
+failed 7/20: up 24.95. request failed
+failed 8/20: up 27.13. request failed
+failed 9/20: up 29.56. request failed
+failed 10/20: up 31.78. request failed
+failed 11/20: up 33.95. request failed
+failed 12/20: up 36.13. request failed
+failed 13/20: up 38.52. request failed
+failed 14/20: up 40.93. request failed
+failed 15/20: up 43.13. request failed
+failed 16/20: up 45.41. request failed
+failed 17/20: up 47.79. request failed
+failed 18/20: up 49.98. request failed
+failed 19/20: up 52.17. request failed
+failed 20/20: up 54.41. request failed
+failed to read iid from metadata. tried 20
+no results found for mode=net. up 56.82. searched: nocloud configdrive ec2
+failed to get instance-id of datasource
 Starting dropbear sshd: generating rsa key... generating dsa key... OK
-no userdata for datasource
 === system information ===
 Platform: OpenStack Foundation OpenStack Nova

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Won't Fix	Undecided	Unassigned
	grenade	Invalid	Undecided	Unassigned

OpenStack Compute (nova)

tempest volume scenarios periodically fail /w SSHTimeout

Bug Description

Other bug subscribers

Remote bug watches