Intermittent ssh tempest timeouts

Bug #1468583 reported by Danny Wilson
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
Invalid
Undecided
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

Two tempest tests intermittently fail while running our CI. These tests also fail when using LVM as a backend so they are not related one particular driver.

test_volume_boot_pattern
test_minimum_basic_scenario

Scenario
A VM is rebooted and then the test tries to ssh back into the VM. The ssh fails with the specific error (timeout). It will retry for 4-6 times before failing to connect and failing the test case.

SSH errors that occur but do not cause test case failure
There is built in code that retries the ssh connection to give the VM time to boot up. The errors seen in the case are ([Errno 111] Connection refused) and (Authentication failed) and ([Errno 113] No route to host). If these errors are seen the ssh connection is eventually successful and the test case moves on.

The error strings in parens () above are what show up in this log line. When the error is (timed out) as below the test always fails.

12:33:13 2015-06-18 12:26:50,762 5533 WARNING [tempest_lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.1 (timed out). Number attempts: 4. Retry after 5 seconds.

Revision history for this message
Danny Wilson (daniel-wilson) wrote :

Adding an etherpad that contains some notes.

https://etherpad.openstack.org/p/Tempest_SSH_Timeouts

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

I am running a 3rd party CI (Scality) and I also see a lot of these errors (fail negative rate is 1 out of 25 approx, imo). It's always the test test_volume_boot_pattern that fails "randomly" with "tempest_lib.exceptions.SSHTimeout: Connection to the 172.24.5.2 via SSH timed out."

What I don't understand yet is why the gate doesn't see this.

Revision history for this message
Yaroslav Lobankov (ylobankov) wrote :

Hi guys, I has encountered this issue as well. On my CI it happens seldom, but it happens. At first I thought that it was a network issue on CI, but it turns out the issue is in something else. I wonder why only these tests fail?

Revision history for this message
Patrick East (patrick-east) wrote :

We are going to disable these tests on the Pure Storage CI while we are trying to figure out why they are failing... as-is it is just causing noise for what would otherwise be successful test passes. We feel confident that the issue is not caused by the volume driver.

Revision history for this message
Yaroslav Lobankov (ylobankov) wrote :
Revision history for this message
Danny Wilson (daniel-wilson) wrote :

Thanks Yaroslav,
It looks like that failure may be slightly different.

The error that causes the failure in your case is ([Errno 113] No route to host). It might be the same cause but the behavior is slightly different. Just wanted to make a note of that.

Revision history for this message
Yaroslav Lobankov (ylobankov) wrote :

It looks like here we can see exactly the same error what Danny is talking about http://logs.openstack.org/49/175949/4/check/check-tempest-dsvm-full-kilo/357d176/logs/testr_results.html.gz

Changed in tempest:
status: New → Confirmed
Revision history for this message
Sean McGinnis (sean-mcginnis) wrote :

I believe this has since been fixed and is not longer applicable to cinder. Please reopen if I am mistaken.

Changed in cinder:
status: New → Invalid
Revision history for this message
chandan kumar (chkumar246) wrote :

@sean, As per your last comment, seems this issue is fixed. Can we close this?

Revision history for this message
Martin Kopec (mkopec) wrote :

As there is a long inactivity here based on the comments I assume that the issue is either fixed or simply not occurring again. If you hit the problem again, please, feel free to reopen it. For now it seems the bug report is no longer valid.

Changed in tempest:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.