timeouts in neutron are causing ssh failures in tempest test instances

Bug #1757556 reported by wes hayutin
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
tripleo
Won't Fix
Critical
Arx Cruz

Bug Description

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 88, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 404, in test_network_basic_ops
    self.check_public_network_connectivity(should_connect=True)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/test_network_basic_ops.py", line 208, in check_public_network_connectivity
    self.servers, mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 585, in check_public_network_connectivity
    mtu=mtu)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 571, in check_vm_connectivity
    self.get_remote_client(ip_address, username, private_key)
  File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 358, in get_remote_client
    linux_client.validate_authentication()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 60, in wrapper
    six.reraise(*original_exception)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 33, in wrapper
    return function(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 116, in validate_authentication
    self.ssh_client.test_connection_auth()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
    connection = self._get_ssh_connection()
  File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTime

http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/tempest.html.gz

Several timeouts in neutron agents visible here:

http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/extra/errors.txt.gz

Was pointed out there are several "no such device" errors in the ovs logs, more than usual

http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/openvswitch/ovs-vswitchd.log.txt.gz

poking around for more details

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I don't think 'no such device' errors are relevant, they are happening all the time in ovs log and generally don't indicate an issue.

Note that the test fails on ssh but not on ping that is executed just before ssh attempt. So it's something with security groups. I haven't noticed anything particularly interesting in ovs agent though. (Though I haven't checked all nodes, that's something to explore further.)

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Arx Cruz (arxcruz) wrote :
Revision history for this message
Arx Cruz (arxcruz) wrote :
Revision history for this message
yatin (yatinkarel) wrote :
Download full text (3.2 KiB)

<<< Isn't this related to https://bugs.launchpad.net/tripleo/+bug/1757134 ?
>> There ovn-metadata was not enabled, so that's a different issue.

<<< Several timeouts in neutron agents visible here:
<<< http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/extra/errors.txt.gz

All the timeouts here are before 16:13:01.649 which are during the setup of services, instance booted at 16:18, so i think Messaging timeout is not related here.

Here it looks it is taking more time to get reply from meta-data agent(time: > 9 seconds) which is causing the failures
From console log:- failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key

metadata log:- http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/containers/neutron/metadata-agent.log.txt.gz#_2018-03-21_16_18_57_094

Copying all the requests here, so someone can acknowledge the times:-

10.100.0.8,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 5.5689421
10.100.0.8,<local> "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 176 time: 9.7856338
10.100.0.8,<local> "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 535 time: 12.6972821
10.100.0.8,<local> "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 9.7274239
10.100.0.8,<local> "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200 len: 136 time: 9.6607111
10.100.0.8,<local> "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200 len: 142 time: 9.7088010
10.100.0.8,<local> "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200 len: 146 time: 9.7054300
10.100.0.8,<local> "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200 len: 150 time: 9.6956398
10.100.0.8,<local> "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200 len: 180 time: 9.7051752
10.100.0.8,<local> "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 297 time: 9.9423590
10.100.0.8,<local> "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200 len: 143 time: 9.7030501
10.100.0.8,<local> "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200 len: 138 time: 9.6972198
10.100.0.8,<local> "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 143 time: 9.6946170
10.100.0.8,<local> "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" status: 200 len: 180 time: 9.7149651
10.100.0.8,<local> "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 139 time: 9.7201109

I tried to find the timeout, and found that Timeout is 10([2]) for the different metadata request([1]) in cirros but it took > 12 seconds for public-keys:-
10.100.0.8,<local> "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 535 time: 12.6972821
[1] https://github.com/XANi/cirros/blob/42fdb92afa37a78fb2de179aeb0fc6da10430406/src/lib/cirros/ds/ec2#L66-L73
[2] https://github.com/XANi/cirros/blob/42fdb92afa37a78fb2de179aeb0fc6da10430406/src/usr/bin/ec2metadata#L5

Is this much response time expected from metadat...

Read more...

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Could this be slow because of the openvswitch bug eating CPU time?
https://bugs.launchpad.net/neutron/+bug/1750777

The fix[1] is merged, but since there was no promotion since ...

https://review.openstack.org/549331

Revision history for this message
Brian Haley (brian-haley) wrote :

Harald - I don't know, but both changes are in master, just waiting for the second stable/queens backport to land.

Revision history for this message
Brian Haley (brian-haley) wrote :

And regarding the slow metadata responses, I can see them over in the nova-api-metadata log, so I'm not sure it's the metadata agent proxy:

http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/containers/nova/nova-api-metadata.log.txt.gz

2018-03-21 16:21:17.217 9 INFO nova.metadata.wsgi.server [req-cc5f57a2-668c-4d0c-a330-42ba2a61b3fd - - - - -] 10.100.0.8,192.168.24.3 "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" status: 200 len: 120 time: 9.4926560

There's a number of 9+ second entries in the log.

Revision history for this message
yatin (yatinkarel) wrote :

I found that l3 agent is taking too much time for doing router operations: like gateway add, floating ip addition. Checked at multiple places where effected. I think this is affecting metadata request to start getting response inside vms. And about the large time taken from metadata response(9+ seconds) need to dig more. Can l3 agent be reason there?

Adding one example from the bug logs only:-
Request to add interface:- http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/containers/neutron/server.log.txt.gz#_2018-03-21_16_17_29_961
Started Update:- http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/containers/neutron/l3-agent.log.txt.gz#_2018-03-21_16_17_29_956
Finished Update:- http://logs.openstack.org/22/531322/13/gate/tripleo-ci-centos-7-3nodes-multinode/fb044de/logs/subnode-3/var/log/containers/neutron/l3-agent.log.txt.gz#_2018-03-21_16_18_09_107

So the request took 40 seconds to complete. Other place i saw it's greater than 50 seconds, If parallel requests come it goes around 2 minutes or more. Isn't it too high? From devstack and non containerized deployments i found that it took 3-4 seconds.

Revision history for this message
Alan Pevec (apevec) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/557424

Revision history for this message
yatin (yatinkarel) wrote :

Filed a bug in oslo.rootwrap about the performance issue: https://bugs.launchpad.net/oslo.rootwrap/+bug/1760471

Revision history for this message
Emilien Macchi (emilienm) wrote :

removing the alert on #tripleo channel as it's a tracked issue in oslo.

tags: removed: alert
Matt Young (halcyondude)
Changed in tripleo:
assignee: nobody → Arx Cruz (arxcruz)
Revision history for this message
Alan Pevec (apevec) wrote :

There is nothing to be fixed in oslo.rootwrap, workaround in paunch + THT will do for originally reported Neutron issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/560359

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/560991

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/560991
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ca09d08bf782ad2e9c762fb8f945c06119bd9c28
Submitter: Zuul
Branch: stable/queens

commit ca09d08bf782ad2e9c762fb8f945c06119bd9c28
Author: yatin <email address hidden>
Date: Wed Apr 11 15:53:22 2018 +0530

    Set ulimit for nova-compute and cinder-volume

    Nova compute and cinder volume uses oslo concurrency
    processuitls.execute to run privileged commands.
    Containers inherit file descriptor limit from docker daemon
    (currently:1048576) which is too high and leads to performance
    issue. This patch sets nofile limit to 1024 for nova compute
    and 131072 for cinder volume, which is reasonable as before
    containers nova compute used host defaults i.e 1024 and cinder
    volume systemctl override([1]) i.e 131072. Also updated neutron
    l3, dhcp and ovs agent to use Parameters for ulimit configuration.

    [1] https://review.rdoproject.org/r/#/c/1360/.

    Closes-Bug: #1762455
    Related-Bug: #1760471
    Related-Bug: #1757556
    Change-Id: I4d4b36de32f8a8e311efd87ea1c4095c5568dec4

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/560359
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=70276931a4664d0bedcd6b0caa5a9cac2b73187b
Submitter: Zuul
Branch: master

commit 70276931a4664d0bedcd6b0caa5a9cac2b73187b
Author: yatin <email address hidden>
Date: Wed Apr 11 15:53:22 2018 +0530

    Set ulimit for nova-compute and cinder-volume

    Nova compute and cinder volume uses oslo concurrency
    processuitls.execute to run privileged commands.
    Containers inherit file descriptor limit from docker daemon
    (currently:1048576) which is too high and leads to performance
    issue. This patch sets nofile limit to 1024 for nova compute
    and 131072 for cinder volume, which is reasonable as before
    containers nova compute used host defaults i.e 1024 and cinder
    volume systemctl override([1]) i.e 131072. Also updated neutron
    l3, dhcp and ovs agent to use Parameters for ulimit configuration.

    [1] https://review.rdoproject.org/r/#/c/1360/.

    Closes-Bug: #1762455
    Related-Bug: #1760471
    Related-Bug: #1757556
    Change-Id: I4d4b36de32f8a8e311efd87ea1c4095c5568dec4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/561498

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/561498
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d3d27d7ea84fe01320dee3e4fa2e2fad11640eb7
Submitter: Zuul
Branch: master

commit d3d27d7ea84fe01320dee3e4fa2e2fad11640eb7
Author: yatin <email address hidden>
Date: Mon Apr 16 08:12:48 2018 +0530

    Use hiera interpolation for memcached_network

    After [1] iptables rules are not set for memcached service
    thus services relying on memcached were not functioning well.
    With [2] it's requrired to use hiera interpolation for service
    configs, this patch fixes it for memcached_network.

    [1] https://review.openstack.org/#/c/551292
    [2] https://review.openstack.org/#/c/526692

    Related-Bug: #1757556
    Closes-Bug: #1763009
    Change-Id: If9b274192ea4738f455a6106ff1a62eb4e7a5c91

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/561636

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/561637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/queens)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/561637
Reason: not needed in fact.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/561636

Changed in tripleo:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/557424

Changed in tripleo:
milestone: rocky-2 → rocky-3
Revision history for this message
wes hayutin (weshayutin) wrote :

Closing as out of date, please re-open if needed

Changed in tripleo:
status: Triaged → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/557073

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/561637
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ada9bc36732649e34721e64a4a92de54fd4d8842
Submitter: Zuul
Branch: stable/queens

commit ada9bc36732649e34721e64a4a92de54fd4d8842
Author: yatin <email address hidden>
Date: Mon Apr 16 08:12:48 2018 +0530

    Use hiera interpolation for memcached_network

    After [1] iptables rules are not set for memcached service
    thus services relying on memcached were not functioning well.
    With [2] it's requrired to use hiera interpolation for service
    configs, this patch fixes it for memcached_network.

    [1] https://review.openstack.org/#/c/551292
    [2] https://review.openstack.org/#/c/526692

    Related-Bug: #1757556
    Closes-Bug: #1763009
    Change-Id: If9b274192ea4738f455a6106ff1a62eb4e7a5c91
    (cherry picked from commit d3d27d7ea84fe01320dee3e4fa2e2fad11640eb7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/558412
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=4c13745561b50191359e663fd8676f62bf5d86df
Submitter: Zuul
Branch: master

commit 4c13745561b50191359e663fd8676f62bf5d86df
Author: yatinkarel <email address hidden>
Date: Tue Jan 21 13:04:43 2020 +0530

    Add option to run performance tasks with security disabled

    It's noted that kernel fixes for security/meltdown
    has affected performance at least I/O. As we run deployments
    in different environments it's good to add an option
    which can adjust the security settings to test performance.

    This patch adds a new variable "kernel_sec" which
    defaults to "true". This can be set to "false" in
    order to run performance tasks with disable security.

    Also add molecule scenario to test security_disable tasks.

    Change-Id: If6f0bc02db6932aa8f367e38cc82df85063eaf87
    Related-Bug: #1757556

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.