tempest.scenario.test_network_basic_ops.TestNetworkBasicOps Failing on queens

Bug #1836046 reported by Arx Cruz on 2019-07-10
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
chandan kumar

Bug Description

(note: there is more info on related bugs@
  --> https://bugs.launchpad.net/nova/+bug/1836642
  --> https://bugs.launchpad.net/openstack-gate/+bug/1808010
  --> https://bugs.launchpad.net/devstack/+bug/1837980 )

)

http://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/d0ac9d1/logs/tempest.html.gz

Ssh is timeing out.

        Body:
2019-07-10 06:59:41,690 80769 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:_run_cleanups): 204 DELETE https://[2001:db8:fd00:1000::5]:13696/v2.0/networks/574ddc44-7db2-4880-bb11-cf2a59170135 1.128s
2019-07-10 06:59:41,690 80769 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'status': '204', u'content-length': '0', 'content-location': 'https://[2001:db8:fd00:1000::5]:13696/v2.0/networks/574ddc44-7db2-4880-bb11-cf2a59170135', u'date': 'Wed, 10 Jul 2019 06:59:41 GMT', u'connection': 'close', u'x-openstack-request-id': 'req-e5d655ce-e4bd-4b64-9b07-6b2e08385f46'}
        Body:
}}}

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 88, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 404, in test_network_basic_ops
    self.check_public_network_connectivity(should_connect=True)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 208, in check_public_network_connectivity
    self.servers, mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 585, in check_public_network_connectivity
    mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 571, in check_vm_connectivity
    self.get_remote_client(ip_address, username, private_key)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 358, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.107 via SSH timed out.
User: cirros, Password: None

Nate Johnston (nate-johnston) wrote :

it looks like ssh is successfully able to connect to the instance but the instance does not accept the test's public ssh key for authentication.

    2019-07-10 07:10:28,483 88330 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
    2019-07-10 07:10:28,606 88330 INFO [paramiko.transport] Authentication (publickey) failed.
    2019-07-10 07:10:28,725 88330 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.106 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.

This is caused by the failure of the instance to fetch metadata:

    failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys

This is a known issue.

Nate Johnston (nate-johnston) wrote :

This appears related to https://bugs.launchpad.net/bugs/1836642 although I am not confident enough to call it a duplicate yet.

Marios Andreou (marios-b) wrote :
description: updated
Marios Andreou (marios-b) wrote :

we were lucky to get a queens promotion so we're green again there but just saw another example from this morning
http://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/2758486/logs/tempest.html.gz

2019-07-26 08:18:17,276 109279 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@10.0.0.106 ([Errno None] Unable to connect to port 22 on 10.0.0.106). Number attempts: 1. Retry after 2 seconds.

description: updated
Marios Andreou (marios-b) wrote :

related fix from Slavek @ "Use memcached based cache in nova in all devstack-tempest jobs" https://review.opendev.org/#/c/672715/2 I8efedb8c5ae073bd2cb4034369bfcff4b67eb868

Changed in tripleo:
milestone: train-2 → train-3
chandan kumar (chkumar246) wrote :

We are seeing it again https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/28ff548/logs/tempest.html.gz
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager connection = self._get_ssh_connection()
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager password=self.password)
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager SSHTimeout: Connection to the 10.0.0.106 via SSH timed out.
2019-08-10 07:02:44.940 78835 ERROR tempest.scenario.manager User: cirros, Password: None

Martin Schuppert (mschuppert) wrote :

caching is already enabled in tripleo for nova for [cache] and [keystone_authtoken] [1]:

[cache]
...
backend=oslo_cache.memcache_pool
...
memcache_servers=172.17.0.17:11211,172.17.0.11:11211,172.17.0.15:11211

[1] https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/28ff548/logs/overcloud-controller-0/var/log/config-data/nova/etc/nova/nova.conf.txt.gz

Slawek Kaplonski (slaweq) wrote :

Thx Martin for checking this.
So now the question is why nova-metadata-api processed call http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key for more than 10 seconds. See here:

https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/28ff548/logs/overcloud-controller-0/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2019-08-10_06_57_49_444

And that was in this case reason why SSH key wasn't configured on guest VM thus SSH to it was not possible.

Before this failed request there was request to http://169.254.169.254/2009-04-04/meta-data/public-keys/ from the same instance and it was processed by other controller node:

https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/28ff548/logs/overcloud-controller-1/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2019-08-10_06_57_38_551

It also took more than 7 seconds (so it's long) but as this was less than 10 seconds, it was enough for ec2-metadata script which is run on Cirros image to get all this data.

Changed in tripleo:
assignee: nobody → chandan kumar (chkumar246)
status: Triaged → In Progress
chandan kumar (chkumar246) wrote :

Copying sean comments from prodchain escalations cards

this can be worked around by either enabling config drive or
setting the correct image_ssh_password in the tempest.conf

https://github.com/openstack/tempest/blob/master/tempest/config.py#L802-L807

the current tempest config only sets the user to cirros

https://logs.rdoproject.org/openstack-periodic-24hr/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-queens/28ff548/logs/undercloud/home/zuul/tempest/etc/tempest.conf.txt.gz

we should also be setting image_ssh_password="gocubsgo"
so that it can fall back to the password if the ssh key is not retrieved.

chandan kumar (chkumar246) wrote :

We have seen the same issue on master and other branches, So enabling config_drive will fix the issue.

Reviewed: https://review.opendev.org/676439
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=810de643b41f4e4db456f62f6409fe93c2dfe89c
Submitter: Zuul
Branch: master

commit 810de643b41f4e4db456f62f6409fe93c2dfe89c
Author: Chandan Kumar (raukadah) <email address hidden>
Date: Wed Aug 14 20:29:32 2019 +0530

    Use config_drive for network basicops tests

    For running network basicops tests, we need to set
    compute_feature_enabled.config_drive to true so that
    user-data should be passed to the cirros vm while booting.

    Related-Bug: #1836046

    Change-Id: Ieac018bf19481666e0db31dfa70a78dd0dc3f046
    Signed-off-by: Chandan Kumar (raukadah) <email address hidden>

Reviewed: https://review.opendev.org/678008
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=5bd20e79470de98e4edceaf110de8462af735458
Submitter: Zuul
Branch: master

commit 5bd20e79470de98e4edceaf110de8462af735458
Author: Ronelle Landy <email address hidden>
Date: Thu Aug 22 08:50:27 2019 -0400

    Add compute_feature_enabled.config_drive to configure-tempest.sh

    https://review.opendev.org/#/c/676439 added this setting to
    fs001 but it's needed generally. Adding it to
    configure-tempest.sh.

    Change-Id: Ic23fb25f1d03d35edd4ae8850003b90db422b438
    Related-Bug: #1836046

chandan kumar (chkumar246) wrote :
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers