[10.0] [Tempest] Tests are failed with SSH timed out

Bug #1678044 reported by Yury Tregubov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
Yury Tregubov

Bug Description

On mos10 build #1544 cirros image is failed to operate in the following ways.

1. Boot log contains these errors:
=== sshd host keys ===
-----BEGIN SSH HOST KEY KEYS-----
[ 4.612379] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_rsa_host_key'
[ 4.645629] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_dss_host_key'

2. Connection to the image works quite bad:

root@node-1:~# ping 10.109.23.129
PING 10.109.23.129 (10.109.23.129) 56(84) bytes of data.
From 10.109.22.1 icmp_seq=1 Destination Port Unreachable
From 10.109.22.1 icmp_seq=2 Destination Port Unreachable
From 10.109.22.1 icmp_seq=3 Destination Port Unreachable
--- 10.109.23.129 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 1998ms

But just few seconds later:

root@node-1:~# ping 10.109.22.129
PING 10.109.22.129 (10.109.22.129) 56(84) bytes of data.
64 bytes from 10.109.22.129: icmp_seq=1 ttl=63 time=3.37 ms
64 bytes from 10.109.22.129: icmp_seq=2 ttl=63 time=1.47 ms
64 bytes from 10.109.22.129: icmp_seq=3 ttl=63 time=1.71 ms
--- 10.109.22.129 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.478/2.190/3.377/0.846 ms
root@node-1:~# ssh cirros@10.109.23.129

3. And also ssh refuses all connections:

ssh: connect to host 10.109.23.129 port 22: Connection refused

There are no such problems with cirros 0.3.5: http://download.cirros-cloud.net/0.3.5/cirros-0.3.5-x86_64-disk.img

Looks like the problem was introduced in
https://review.fuel-infra.org/#/c/32611/

Changed in fuel:
milestone: none → 10.1
description: updated
Changed in fuel:
importance: Undecided → High
description: updated
tags: added: blocker-for-qa
Changed in fuel:
importance: High → Critical
Revision history for this message
Yury Tregubov (ytregubov) wrote :

The problem affects Tempest passrate down to 96%

Revision history for this message
Yury Tregubov (ytregubov) wrote :
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

So, what is the problem? There is not enough entropy when system is booting - it is normal situation. After some time there is enough collected and all works okay. So - why we should do something here?

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

As a result - I close this as invalid. User can wait a bit when there will be enough entropy collected on best-effort basis or he always can build an image which will use userspace app called getrandom(2) systemcall (which is blocking for some time automatically) instead of getentropy(2).

Changed in fuel:
status: New → Invalid
Revision history for this message
Yury Tregubov (ytregubov) wrote :

Ok, lets assume that problems with connection are no problems so far.

But what about errors during the image boot:

=== sshd host keys ===
-----BEGIN SSH HOST KEY KEYS-----
[ 4.612379] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_rsa_host_key'
[ 4.645629] random: dropbearkey: uninitialized urandom read (32 bytes read, 18 bits of entropy available)
Failed reading '/etc/dropbear/dropbear_dss_host_key'

Revision history for this message
Yury Tregubov (ytregubov) wrote :

Ah.. that was about entropy. Sorry.

So the problem is that Tempest runs have quite low pass rate due to that issue.
Lets investigate it in this way than:

On mos10 build #1544 a lot of tempest tests are failed with this error:

Stacktrace

Traceback (most recent call last):
  File "tempest/test.py", line 99, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 137, in test_server_connectivity_rebuild
    floating_ip = self._setup_network(server, keypair)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 77, in _setup_network
    server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 101, in _wait_server_status_and_check_network_connectivity
    self._check_network_connectivity(server, keypair, floating_ip)
  File "tempest/scenario/test_network_advanced_server_ops.py", line 94, in _check_network_connectivity
    servers=[server])
  File "tempest/scenario/manager.py", line 588, in check_public_network_connectivity
    mtu=mtu)
  File "tempest/scenario/manager.py", line 574, in check_vm_connectivity
    self.get_remote_client(ip_address, username, private_key)
  File "tempest/scenario/manager.py", line 331, in get_remote_client
    linux_client.validate_authentication()
  File "tempest/common/utils/linux/remote_client.py", line 54, in wrapper
    six.reraise(*original_exception)
  File "tempest/common/utils/linux/remote_client.py", line 35, in wrapper
    return function(self, *args, **kwargs)
  File "tempest/common/utils/linux/remote_client.py", line 99, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "tempest/lib/common/ssh.py", line 176, in test_connection_auth
    connection = self._get_ssh_connection()
  File "tempest/lib/common/ssh.py", line 90, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.109.4.133 via SSH timed out.
User: cirros, Password: None

Diagnostic snapshot is:
https://drive.google.com/a/mirantis.com/file/d/0B1Crk-sAvGanbDkyQTY5Sjd3QXc/view?usp=sharing

And is starts to happen today.

summary: - [10.0] [Tempest] Cirros image is broken
+ [10.0] [Tempest] Tests are failed with SSH timed out
Changed in fuel:
status: Invalid → New
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So, apparently, the issue is that tempest does not wait for the instance to boot properly and tries to ssh when ssh is not started.

no longer affects: fuel
Changed in mos:
importance: Undecided → Critical
milestone: none → 10.0
assignee: nobody → Yury Tregubov (ytregubov)
status: New → Triaged
tags: added: area-qa
Revision history for this message
Oleksiy Butenko (obutenko) wrote :

I don't agree.
It's a bad practice to edit tests (increase timeouts).
Why we use custom cirros image?
I'll upload upstream image to the env with this error and restart the tests and then
attach results.

tags: removed: area-qa
Revision history for this message
Alexey Shtokolov (ashtokolov) wrote :
Changed in mos:
status: Triaged → Fix Committed
Revision history for this message
Yury Tregubov (ytregubov) wrote :

Verified on mos 10.0 #1561 build

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Ekaterina Shutova (eshutova) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.