[queens promotion] Tempest (fs020) is timing out - exit_value=143

Bug #1751180 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Triaged
Critical
Arx Cruz

Bug Description

The Tempest run in Featureset020 has been timing out and failing with:

 21:40:57 TASK [validate-tempest : Execute tempest] **************************************
21:40:57 Thursday 22 February 2018 21:40:57 +0000 (0:00:00.065) 0:00:28.405 *****
00:09:27 +(./toci_quickstart.sh:95): exit_value=143
00:09:27 +(./toci_quickstart.sh:98): [[ 143 == 0 ]]
00:09:27 +(./toci_quickstart.sh:99): [[ 143 != 0 ]]
00:09:27 +(./toci_quickstart.sh:99): echo 'Playbook run of baremetal-full-overcloud-validate.yml failed'
00:09:27 Playbook run of baremetal-full-overcloud-validate.yml failed

tempest.log shows a number of errors/warnings similar to:

2018-02-23 00:00:42.085 4888 WARNING tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to cirros@192.168.24.107 ([Errno None] Unable to connect to port 22 on 192.168.24.107). Number attempts: 1. Retry after 2 seconds.: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 192.168.24.107

And then finally, this error:

2018-02-23 00:05:28.880 4888 WARNING tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to cirros@192.168.24.107 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.: AuthenticationException: Authentication failed.
2018-02-23 00:05:53.410 4888 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2012.55)
2018-02-23 00:05:53.531 4888 INFO paramiko.transport [-] Authentication (publickey) failed.
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to cirros@192.168.24.107 after 23 attempts: AuthenticationException: Authentication failed.
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh sock=proxy_chan)
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 380, in connect
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 621, in _auth
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh raise saved_exception
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.
2018-02-23 00:05:53.653 4888 ERROR tempest.lib.common.ssh

The full tempest log is here:

https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/7cead15/undercloud/home/jenkins/tempest/tempest.log.txt.gz

Revision history for this message
Ronelle Landy (rlandy) wrote :
Changed in tripleo:
milestone: none → queens-rc1
importance: Undecided → Critical
status: New → Triaged
tags: added: ci promotion-blocker
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
yatin (yatinkarel) wrote :

Seeing again:- https://review.rdoproject.org/jenkins/job/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/104/consoleText

The thing i noticed is we are still using ostestr([1]) to run tempest tests with tempest 18.0.0(switch to stestr), we should start using stestr to run tempest tests from queens(see bug [2]).

I don't know how much this would improve the timeout case but this would definitely help.

[1] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/validate-tempest/templates/run-tempest.sh.j2#L15
[2] https://bugs.launchpad.net/tempest/+bug/1751115

Revision history for this message
Rafael Folco (rafaelfolco) wrote :

Re-opening this bug. Although the failures are not 100% consistent, ssh timeout is happening for all the failing tests.

latest 3 logs:

tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_boot_into_disabled_port_security_network_without_secgroup:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/267528d/undercloud/home/jenkins/tempest_output.log.txt.gz#_2018-03-23_00_02_02

tempest.api.compute.floating_ips.test_floating_ips_actions.FloatingIPsAssociationTestJSON:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/668ceaa/undercloud/home/jenkins/tempest_output.log.txt.gz#_2018-03-22_12_37_05

tempest.api.compute.floating_ips.test_floating_ips_actions.FloatingIPsAssociationTestJSON:
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset020-queens/c81ef68/undercloud/home/jenkins/tempest_output.log.txt.gz#_2018-03-22_04_16_48

Changed in tripleo:
status: Fix Released → Confirmed
Changed in tripleo:
assignee: nobody → Arx Cruz (arxcruz)
Revision history for this message
Damien Ciabrini (dciabrin) wrote :
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
Arx Cruz (arxcruz) wrote :

I prep a local env for featureset020, and found the problem with the timeout:

All the tests failing are related to ssh connection. On my setup, I saw 9 failures and those take a long time running (until ssh timeout).
An example, is the TestMinimumBasicScenario, that usually takes 180 seconds to run, in the failed run takes 514.

From those 9 failures, I count 4367 seconds, that give us 72 minutes of wasted tests.

I also confirm that ovs is version 2.8 (might related to https://bugs.launchpad.net/tripleo/+bug/1757556)

Below the failing tests and the amount of time. Also In my setup, i wasn't able to have it done, because even with a 600 minutes timeout, it didn't finishes.

tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario - [514.348089s]
tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_pause_unpause - [571.162965s]
tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic - [370.094413s]
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_router_rescheduling - [603.494442s]
tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_multiple_security_groups - [374.109288s]
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_subnet_details - [546.753750s]
tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_update_new_security_group - [386.824888s]
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_router_admin_state - [535.216286s]
tempest.scenario.test_network_v6.TestGettingAddress.test_dhcp6_stateless_from_os - [465.421073s]

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.