[tempest] migration tests often fail with ssh timeouts

Bug #1810988 reported by Tom Barron on 2019-01-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Manila
High
Unassigned

Bug Description

Even after we've fixed a number of ssh/paramiko issues when connecting to service VMs from the manila-share service, migration jobs often fail with ssh timeouts like the following:

2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to manila@172.24.5.139 after 18 attempts: timeout: timed out
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh File "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh sock=proxy_chan)
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 343, in connect
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/util.py", line 280, in retry_on_signal
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh return function()
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 343, in <lambda>
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/socket.py", line 228, in meth
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh return getattr(self._sock,name)(*args)
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh timeout: timed out
2019-01-05 22:01:15.188 25836 ERROR tempest.lib.common.ssh

Note that these are ssh connections from tempest itself, where the channel connection timeout is set to 10 secs. Above you can see that there were 18 attempts to connect, all of which timed out, before a more global timer expired and tempest declared failure.

This is likely a significant cause of both failures and overall job timeouts with migration tests.

Tom Barron (tpb) on 2019-01-08
Changed in manila:
importance: Undecided → High
milestone: none → stein-2
Changed in manila:
assignee: nobody → Tom Barron (tpb)
status: New → In Progress
Tom Barron (tpb) on 2019-03-14
tags: added: share-migration
Tom Barron (tpb) on 2019-03-17
Changed in manila:
milestone: stein-2 → stein-rc1
Jason Grosso (jgrosso) on 2019-06-20
Changed in manila:
assignee: Tom Barron (tpb) → nobody
Tom Barron (tpb) on 2019-09-05
Changed in manila:
milestone: stein-rc1 → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers