Bug #1678044 “[10.0] [Tempest] Tests are failed with SSH timed o...” : Bugs : Mirantis OpenStack

Yury Tregubov (ytregubov) on 2017-03-31

Changed in fuel:
milestone:	none → 10.1
description:	updated
Changed in fuel:
importance:	Undecided → High

Yury Tregubov (ytregubov) on 2017-03-31

description:	updated
tags:	added: blocker-for-qa
Changed in fuel:
importance:	High → Critical

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2017-03-31:

#1

The problem affects Tempest passrate down to 96%

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2017-03-31:

#2

Diagnostic snapshot is:
https://drive.google.com/a/mirantis.com/file/d/0B1Crk-sAvGanbDkyQTY5Sjd3QXc/view?usp=sharing

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2017-03-31:

#3

So, what is the problem? There is not enough entropy when system is booting - it is normal situation. After some time there is enough collected and all works okay. So - why we should do something here?

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2017-03-31:

#4

As a result - I close this as invalid. User can wait a bit when there will be enough entropy collected on best-effort basis or he always can build an image which will use userspace app called getrandom(2) systemcall (which is blocking for some time automatically) instead of getentropy(2).

Changed in fuel:
status:	New → Invalid

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2017-03-31:

#5

Ok, lets assume that problems with connection are no problems so far.

But what about errors during the image boot:

=== sshd host keys ===
-----BEGIN SSH HOST KEY KEYS-----
[ 4.612379] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_rsa_host_key'
[ 4.645629] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_dss_host_key'

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2017-03-31:

#6

Ah.. that was about entropy. Sorry.

So the problem is that Tempest runs have quite low pass rate due to that issue.
Lets investigate it in this way than:

On mos10 build #1544 a lot of tempest tests are failed with this error:

Stacktrace

Traceback (most recent call last):
  File "tempest/test.py", line 99, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 137, in test_server_connectivity_rebuild
    floating_ip = self._setup_network(server, keypair)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 77, in _setup_network
    server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 101, in _wait_server_status_and_check_network_connectivity
    self._check_network_connectivity(server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 94, in _check_network_connectivity
    servers=[server])
  File "tempest/scenario/manager.py", line 588, in check_public_network_connectivity
    mtu=mtu)
  File "tempest/scenario/manager.py", line 574, in check_vm_connectivity
    self.get_remote_client(ip_address, username, private_key)
  File "tempest/scenario/manager.py", line 331, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 54, in wrapper
    six.reraise(*original_exception)
  File "tempest/common/utils/linux/remote_client.py", line 35, in wrapper
    return function(self, *args, **kwargs)
  File "tempest/common/utils/linux/remote_client.py", line 99, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "tempest/lib/common/ssh.py", line 176, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/lib/common/ssh.py", line 90, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.109.4.133 via SSH timed out.
User: cirros, Password: None

Diagnostic snapshot is:
https://drive.google.com/a/mirantis.com/file/d/0B1Crk-sAvGanbDkyQTY5Sjd3QXc/view?usp=sharing

And is starts to happen today.

Ah.. that was about entropy. Sorry.

So the problem is that Tempest runs have quite low pass rate due to that issue.
Lets investigate it in this way than:

On mos10 build #1544 a lot of tempest tests are failed with this error:

Stacktrace

Traceback (most recent call last):
  File "tempest/test.py", line 99, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 137, in test_server_connectivity_rebuild
    floating_ip = self._setup_network(server, keypair)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 77, in _setup_network
    server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 101, in _wait_server_status_and_check_network_connectivity
    self._check_network_connectivity(server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 94, in _check_network_connectivity
    servers=[server])
  File "tempest/scenario/manager.py", line 588, in check_public_network_connectivity
    mtu=mtu)
  File "tempest/scenario/manager.py", line 574, in check_vm_connectivity
    self.get_remote_client(ip_address, username, private_key)
  File "tempest/scenario/manager.py", line 331, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 54, in wrapper
    six.reraise(*original_exception)
  File "tempest/common/utils/linux/remote_client.py", line 35, in wrapper
    return function(self, *args, **kwargs)
  File "tempest/common/utils/linux/remote_client.py", line 99, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "tempest/lib/common/ssh.py", line 176, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/lib/common/ssh.py", line 90, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.109.4.133 via SSH timed out.
User: cirros, Password: None

Diagnostic snapshot is:
https://drive.google.com/a/mirantis.com/file/d/0B1Crk-sAvGanbDkyQTY5Sjd3QXc/view?usp=sharing

And is starts to happen today.

summary:	- [10.0] [Tempest] Cirros image is broken + [10.0] [Tempest] Tests are failed with SSH timed out
Changed in fuel:
status:	Invalid → New

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2017-03-31:

#7

So, apparently, the issue is that tempest does not wait for the instance to boot properly and tries to ssh when ssh is not started.

no longer affects:	fuel
Changed in mos:
importance:	Undecided → Critical
milestone:	none → 10.0
assignee:	nobody → Yury Tregubov (ytregubov)
status:	New → Triaged

Ilya Bumarskov (ibumarskov) on 2017-04-03

tags:

added: area-qa

Revision history for this message

Oleksiy Butenko (obutenko) wrote on 2017-04-03:

#8

I don't agree.
It's a bad practice to edit tests (increase timeouts).
Why we use custom cirros image?
I'll upload upstream image to the env with this error and restart the tests and then
attach results.

Ilya Bumarskov (ibumarskov) on 2017-04-03

tags:

removed: area-qa

Revision history for this message

Alexey Shtokolov (ashtokolov) wrote on 2017-04-03:

#9

https://review.fuel-infra.org/#/c/32721/

Changed in mos:
status:	Triaged → Fix Committed

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2017-04-04:

#10

Verified on mos 10.0 #1561 build

Changed in mos:
status:	Fix Committed → Fix Released

Revision history for this message

Ekaterina Shutova (eshutova) wrote on 2017-04-04:

#11

Verified on MOS 10.0 snap #1561:
Problem is not seen http://cz7776.bud.mirantis.net:8080/jenkins/view/TEMPEST-10.X/job/Tempest_10.x_Ceph_SSL/69

Mirantis OpenStack

[10.0] [Tempest] Tests are failed with SSH timed out

Bug Description

Other bug subscribers

Remote bug watches