Intermittent ssh tempest timeouts

Bug #1468583 reported by Danny Wilson on 2015-06-25
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Unassigned
tempest
Undecided
Unassigned

Bug Description

Two tempest tests intermittently fail while running our CI. These tests also fail when using LVM as a backend so they are not related one particular driver.

test_volume_boot_pattern
test_minimum_basic_scenario

Scenario
A VM is rebooted and then the test tries to ssh back into the VM. The ssh fails with the specific error (timeout). It will retry for 4-6 times before failing to connect and failing the test case.

SSH errors that occur but do not cause test case failure
There is built in code that retries the ssh connection to give the VM time to boot up. The errors seen in the case are ([Errno 111] Connection refused) and (Authentication failed) and ([Errno 113] No route to host). If these errors are seen the ssh connection is eventually successful and the test case moves on.

The error strings in parens () above are what show up in this log line. When the error is (timed out) as below the test always fails.

12:33:13 2015-06-18 12:26:50,762 5533 WARNING [tempest_lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.1 (timed out). Number attempts: 4. Retry after 5 seconds.

Danny Wilson (daniel-wilson) wrote :

Adding an etherpad that contains some notes.

https://etherpad.openstack.org/p/Tempest_SSH_Timeouts

Jordan Pittier (jordan-pittier) wrote :

I am running a 3rd party CI (Scality) and I also see a lot of these errors (fail negative rate is 1 out of 25 approx, imo). It's always the test test_volume_boot_pattern that fails "randomly" with "tempest_lib.exceptions.SSHTimeout: Connection to the 172.24.5.2 via SSH timed out."

What I don't understand yet is why the gate doesn't see this.

Yaroslav Lobankov (ylobankov) wrote :

Hi guys, I has encountered this issue as well. On my CI it happens seldom, but it happens. At first I thought that it was a network issue on CI, but it turns out the issue is in something else. I wonder why only these tests fail?

Patrick East (patrick-east) wrote :

We are going to disable these tests on the Pure Storage CI while we are trying to figure out why they are failing... as-is it is just causing noise for what would otherwise be successful test passes. We feel confident that the issue is not caused by the volume driver.

Danny Wilson (daniel-wilson) wrote :

Thanks Yaroslav,
It looks like that failure may be slightly different.

The error that causes the failure in your case is ([Errno 113] No route to host). It might be the same cause but the behavior is slightly different. Just wanted to make a note of that.

Yaroslav Lobankov (ylobankov) wrote :

It looks like here we can see exactly the same error what Danny is talking about http://logs.openstack.org/49/175949/4/check/check-tempest-dsvm-full-kilo/357d176/logs/testr_results.html.gz

Changed in tempest:
status: New → Confirmed
Sean McGinnis (sean-mcginnis) wrote :

I believe this has since been fixed and is not longer applicable to cinder. Please reopen if I am mistaken.

Changed in cinder:
status: New → Invalid
chandan kumar (chkumar246) wrote :

@sean, As per your last comment, seems this issue is fixed. Can we close this?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers