tempest volume scenarios periodically fail /w SSHTimeout

Bug #1355573 reported by Adam Gandelman
46
This bug affects 10 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned
grenade
Invalid
Undecided
Unassigned

Bug Description

Not sure this is a duplicate, but I've noticed failing tempest volume scenarios as part of grenade tests, ie

http://logs.openstack.org/07/112707/1/check/check-grenade-dsvm-partial-ncpu/43dbf72/logs/testr_results.html.gz

Traceback (most recent call last):
  File "tempest/test.py", line 128, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_volume_boot_pattern.py", line 163, in test_volume_boot_pattern
    keypair)
  File "tempest/scenario/test_volume_boot_pattern.py", line 116, in _ssh_to_server
    private_key=keypair.private_key)
  File "tempest/scenario/manager.py", line 484, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 53, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "tempest/common/ssh.py", line 150, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/common/ssh.py", line 87, in _get_ssh_connection
    password=self.password)
SSHTimeout: Connection to the 172.24.4.2 via SSH timed out.
User: cirros, Password: None

Checking logstash these seem to be happening frequently during grenade jobs, failing both test_snapshot_pattern and test_volume_boot_pattern tests. 141 failures over the last 7 days.

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGUgMTcyLjI0LjQuMiB2aWEgU1NIIHRpbWVkIG91dC5cIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQwNzgxMzYzNjg1Nn0=

Tags: testing
Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah this is a known thing, the ssh timeout failure is generic so it's hard to classify in elastic-recheck:

http://logs.openstack.org/71/179971/2/gate/gate-grenade-dsvm/eb2fb74/logs/grenade.sh.txt.gz#_2015-05-20_07_26_25_655

I think we'll need a change to tempest to log a specific error for the test in this case so we can track this bug.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a change to tempest to add some more contextual logging around the error:

https://review.openstack.org/#/c/184815/

Revision history for this message
Matt Riedemann (mriedem) wrote :
Sean Dague (sdague)
summary: - grenade volume scenarios periodically fail /w SSHTimeout
+ tempest volume scenarios periodically fail /w SSHTimeout
Revision history for this message
Tony Breeds (o-tony) wrote :

Added to nova as it's not grenade specific and I saw it in a tempest-dvvm-full job

tags: added: testing
Changed in nova:
status: New → Confirmed
Sean Dague (sdague)
Changed in nova:
status: Confirmed → Won't Fix
Changed in grenade:
status: New → Invalid
Revision history for this message
Sean Dague (sdague) wrote :

None of the links in this bug are valid any more. If there is a relevant new bug we can do that.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

This report gets old but I am writing some debugging info on this now temporary.

The latest issue is http://logs.openstack.org/72/282872/2/check/gate-tempest-dsvm-postgres-full/db55de4/logs/

The following tests failed:
- test_minimum_basic_scenario
- test_shelve_instance
- test_shelve_volume_backed_instance
- test_snapshot_pattern
- test_volume_boot_pattern
- test_volume_boot_pattern

All tests failed like:

Traceback (most recent call last):
  File "tempest/test.py", line 113, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_shelve_instance.py", line 98, in test_shelve_instance
    self._create_server_then_shelve_and_unshelve()
  File "tempest/scenario/test_shelve_instance.py", line 82, in _create_server_then_shelve_and_unshelve
    private_key=keypair['private_key'])
  File "tempest/scenario/manager.py", line 622, in create_timestamp
    private_key=private_key)
  File "tempest/scenario/manager.py", line 377, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 51, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 173, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 88, in _get_ssh_connection
    password=self.password)
tempest_lib.exceptions.SSHTimeout: Connection to the 172.24.5.10 via SSH timed out.
User: cirros, Password: None

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

picking up tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern here:

date: 2016-02-21 16:55:35.858
server-id: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
ip-address: 172.24.5.9

confirmed the ip-address is allocated to the server:
2016-02-21 16:52:10,047 18343 INFO [tempest_lib.common.rest_client] Request (TestVolumeBootPattern:test_volume_boot_pattern): 202 POST http://127.0.0.1:8774/v2.1/b18fde6446224a348fb327cccbc0bc30/servers/d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd/action 0.945s
2016-02-21 16:52:10,048 18343 DEBUG [tempest_lib.common.rest_client] Request - Headers: {'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'Content-Type': 'application/json'}
        Body: {"addFloatingIp": {"address": "172.24.5.9"}}

Before the timeout, the following message is output several times:
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh Traceback (most recent call last):
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/tempest_lib/common/ssh.py", line 75, in _get_ssh_connection
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh timeout=self.channel_timeout, pkey=self.pkey)
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/client.py", line 325, in connect
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh t.start_client()
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 492, in start_client
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh raise e
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh SSHException: Error reading SSH protocol banner
2016-02-21 16:55:35.857 18343 ERROR tempest_lib.common.ssh
2016-02-21 16:55:35,858 18343 ERROR [tempest.scenario.manager] (TestVolumeBootPattern:test_volume_boot_pattern) Initializing SSH connection to 172.24.5.9 failed. Error: Connection to the 172.24.5.9 via SSH timed out.
User: cirros, Password: None

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

2016-02-21 16:54:40,286 18343

ERROR [paramiko.transport] Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer
ERROR [paramiko.transport] Traceback (most recent call last):
ERROR [paramiko.transport] File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 1710, in run
ERROR [paramiko.transport] self._check_banner()
ERROR [paramiko.transport] File "/opt/stack/new/tempest/.tox/full/local/lib/python2.7/site-packages/paramiko/transport.py", line 1858, in _check_banner
ERROR [paramiko.transport] raise SSHException('Error reading SSH protocol banner' + str(e))
ERROR [paramiko.transport] SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
ERROR [paramiko.transport]
WARNING [tempest_lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.9 (Error reading SSH protocol banner[Errno 104] Connection reset by peer). Number attempts: 16. Retry after 17 seconds.

paramiko/transport.py
 1844 def _check_banner(self):
 1845 # this is slow, but we only have to do it once
 1846 for i in range(100):
 1847 # give them 15 seconds for the first line, then just 2 seconds
 1848 # each additional line. (some sites have very high latency.)
 1849 if i == 0:
 1850 timeout = self.banner_timeout
 1851 else:
 1852 timeout = 2
 1853 try:
 1854 buf = self.packetizer.readline(timeout)
 1855 except ProxyCommandFailure:
 1856 raise
 1857 except Exception as e:
 1858 raise SSHException('Error reading SSH protocol banner' + str(e))

The original exception is [Errno 104] Connection reset by peer.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

screen-c-api.txt.gz when booting from volume:

2016-02-21 16:52:01.832 INFO cinder.api.openstack.wsgi [req-1c9f1cd4-022c-4a8d-9
b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] POST http://127.0.0.1:
8776/v2/b18fde6446224a348fb327cccbc0bc30/volumes/9bed7b85-4472-44cd-9a5b-375010a
5363a/action
2016-02-21 16:52:01.832 DEBUG cinder.api.openstack.wsgi [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] Action body: {"os-attach": {"instance_uuid": "d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd", "mountpoint": "/dev/vda", "mode": "rw"}} get_method /opt/stack/new/cinder/cinder/api/openstack/wsgi.py:1093
2016-02-21 16:52:01.958 DEBUG object [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] Cinder object Volume has no attribute named: type get /opt/stack/new/cinder/cinder/objects/base.py:327
[..]
2016-02-21 16:52:03.108 INFO eventlet.wsgi.server [req-1c9f1cd4-022c-4a8d-9b15-2cc092eb9522 tempest-TestVolumeBootPattern-258726986] 127.0.0.1 "POST /v2/b18fde6446224a348fb327cccbc0bc30/volumes/9bed7b85-4472-44cd-9a5b-375010a5363a/action HTTP/1.1" status: 202 len: 211 time: 1.2800739

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

The volume: 9bed7b85-4472-44cd-9a5b-375010a5363a
The server: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
The network: 3acd1838-7435-48bb-bc86-2da830b5a1c4
The target address: 172.24.5.9 (Floating ip)

16:51:57.655: (nova-api) Received the request of boot from volume.
16:51:58.615: (nova-api) Succeeded to boot from volume.
16:52:03.108: (cinder-api) Succeeded to attach the volume to the server.
16:52:10.050: (nova-api) Succeeded to add the target floating-ip address to the server.
16:52:13.046: (tempest) First failed to establish authenticated ssh connection.
16:55:35.858: (tempest) tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern failed.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :
Download full text (4.0 KiB)

The volume: 9bed7b85-4472-44cd-9a5b-375010a5363a
The server: d8bc1ba6-1b57-41c3-bf8b-8c4981f0c6fd
The network: 3acd1838-7435-48bb-bc86-2da830b5a1c4
The fixed-ip: 10.1.0.9
The target address: 172.24.5.9 (Floating ip)

16:51:57.655: (nova-api) Received the request of boot from volume.
16:51:58.615: (nova-api) Succeeded to boot from volume.
16:52:00.618: (nova-net) Allocated fixed IP to the server.
16:52:03.108: (cinder-api) Succeeded to attach the volume to the server.
16:52:10.050: (nova-api) Succeeded to add the target floating-ip address to the server.
16:52:13.046: (tempest) First failed to establish authenticated ssh connection.
16:55:35.858: (tempest) tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern failed.

The console log contains the following:

cirros-ds 'local' up at 8.49
no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending select for 10.1.0.9...
Lease of 10.1.0.9 obtained, lease time 86400
cirros-ds 'net' up at 10.49
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 10.82. request failed
failed 2/20: up 13.55. request failed
failed 3/20: up 15.76. request failed
failed 4/20: up 18.08. request failed
failed 5/20: up 20.29. request failed
failed 6/20: up 22.75. request failed
failed 7/20: up 24.95. request failed
failed 8/20: up 27.13. request failed
failed 9/20: up 29.56. request failed
failed 10/20: up 31.78. request failed
failed 11/20: up 33.95. request failed
failed 12/20: up 36.13. request failed
failed 13/20: up 38.52. request failed
failed 14/20: up 40.93. request failed
failed 15/20: up 43.13. request failed
failed 16/20: up 45.41. request failed
failed 17/20: up 47.79. request failed
failed 18/20: up 49.98. request failed
failed 19/20: up 52.17. request failed
failed 20/20: up 54.41. request failed
failed to read iid from metadata. tried 20
no results found for mode=net. up 56.82. searched: nocloud configdrive ec2
failed to get instance-id of datasource
Starting dropbear sshd: generating rsa key... generating dsa key... OK
=== system information ===
Platform: OpenStack Foundation OpenStack Nova

The difference between normal ops and this problem ops is:

---------------------
$ diff -u normal.txt problem.txt
--- normal.txt 2016-02-23 09:22:14.920761998 -0800
+++ problem.txt 2016-02-23 09:22:46.568761998 -0800
@@ -1,11 +1,35 @@
-cirros-ds 'local' up at 9.56
-found datasource (configdrive, local)
+cirros-ds 'local' up at 8.49
+no results found for mode=local. up 8.95. searched: nocloud configdrive ec2
 Starting network...
 udhcpc (v1.20.1) started
 Sending discover...
-Sending select for 10.1.8.4...
-Lease of 10.1.8.4 obtained, lease time 86400
+Sending select for 10.1.0.9...
+Lease of 10.1.0.9 obtained, lease time 86400
+cirros-ds 'net' up at 10.49
+checking http://169.254.169.254/2009-04-04/instance-id
+failed 1/20: up 10.82. request failed
+failed 2/20: up 13.55. request failed
+failed 3/20: up 15.76. request failed
+failed 4/20: up 18.08. request failed
+failed 5/20: up 20.29. request failed
+failed 6/20: up 22.75. request failed
+failed 7/20: up 24.95. request...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.