default nova+neutron setup cannot handle spawning 20 images concurrently

Bug #1282842 reported by Robert Collins
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
tripleo
Fix Released
High
Unassigned

Bug Description

This breaks any @scale use of a cloud.

Symptoms include 500 errors from 'nova list' (which causes a heat stack failure) and errors like 'unknown auth strategy' from neutronclient when its being called from the nova compute.manager.

Sorry for the many-project-tasks here - its not clear where the bug lies, nor whether its bad defaults, or code handling errors, or perf tuning etc.

Revision history for this message
Robert Collins (lifeless) wrote :

e.g.
| fault | {u'message': u'Connection to neutron failed: Maximum attempts reached', u'code': 500, u'created': u'2014-02-21T01:13:58Z'} |

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (6.4 KiB)

neutron log has many:
2014-02-21 01:21:40.586 2130 WARNING keystoneclient.middleware.auth_token [-] Authorization failed for token 1768d9a3a640c2a5528b6e1b774685ee
2014-02-21 01:21:40.592 2130 WARNING keystoneclient.middleware.auth_token [-] Authorization failed for token 1768d9a3a640c2a5528b6e1b774685ee
2014-02-21 01:26:23.563 2130 WARNING keystoneclient.middleware.auth_token [-] Authorization failed for token 0e462f4084a0f6f3d7491eb87fbec1e7
2014-02-21 01:30:19.127 2130 WARNING keystoneclient.middleware.auth_token [-] Authorization failed for token fde1a5a7736d4fde58f85f48c39edbbf
2014-02-21 01:31:19.572 2130 WARNING keystoneclient.middleware.auth_token [-] Authorization failed for token d2fdad9855e61d3e92694d310928b2ab

and also many

2014-02-21 01:18:02.888 2130 ERROR neutron.openstack.common.rpc.amqp [req-2941bce3-4a89-4d4c-8c30-2aff95b60fd7 None] Exception during message handling
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp Traceback (most recent call last):
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/amqp.py", line 462, in _process_data
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp **args)
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/common/rpc.py", line 45, in dispatch
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp neutron_ctxt, version, method, namespace, **kwargs)
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/db/agents_db.py", line 217, in report_state
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp self.plugin.create_or_update_agent(context, agent_state)
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/db/agents_db.py", line 179, in create_or_update_agent
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp return self._create_or_update_agent(context, agent)
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/db/agents_db.py", line 158, in _create_or_update_agent
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp context, agent['agent_type'], agent['host'])
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/db/agents_db.py", line 135, in _get_agent_by_type_and_host
2014-02-21 01:18:02.888 2130 TRACE neutron.openstack.common.rpc.amqp Agent.host == host).one()
2014-02-21 01:18:02....

Read more...

description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

and one maybe more of
2014-02-21 01:16:24.020 2130 WARNING neutron.plugins.ml2.drivers.mech_agent [req-b4a6dc64-a7b5-4c63-b123-827944e8f591 None] Attempting to bind with dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2014, 2, 21, 1, 15, 29), 'alive': False, 'topic': u'N/A', 'host': u'ci-overcloud-notcompute0-gxezgcvv4v2q', 'agent_type': u'Open vSwitch agent', 'created_at': datetime.datetime(2014, 2, 7, 12, 15, 31), 'started_at': datetime.datetime(2014, 2, 7, 12, 15, 31), 'id': u'42214336-e444-413e-a400-dc58fca41ca3', 'configurations': {u'tunnel_types': [u'gre'], u'tunneling_ip': u'10.10.16.135', u'bridge_mappings': {u'tripleo-bm-test': u'br-untagged'}, u'l2_population': False, u'devices': 3}}

Revision history for this message
Robert Collins (lifeless) wrote :

keystone log contains nothing but
2014-02-20 06:58:23.188 1116 WARNING keystone.openstack.common.versionutils [-] Deprecated: v2 API is deprecated as of Icehouse in favor of v3 API and may be removed in K.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The neutron error message reported at #2 actually also appears in gate logs:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiUXVldWVQb29sIGxpbWl0IG9mIHNpemUgMTAgb3ZlcmZsb3cgMjAgcmVhY2hlZFwiIEFORCBOT1QgbWVzc2FnZTpcIlRyYWNlYmFjayAobW9zdCByZWNlbnRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7ImZyb20iOiIyMDE0LTAyLTIxVDA5OjI2OjAxKzAwOjAwIiwidG8iOiIyMDE0LTAyLTIxVDExOjI2OjAxKzAwOjAwIiwidXNlcl9pbnRlcnZhbCI6IjAifSwic3RhbXAiOjEzOTMxNjk1OTUzNzYsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

When this error appears, the job often (27% of cases) succeeds anyway. Also, the fact that it does not show up on the large ops job indicates the agents are exhausting the connection pool.

I suggest filing a different neutron bug for this specific issue, and add an E-R query as well, but this is not important.
The 500 errors returned from Nova might be because of this error affecting an API request, but also because of bug 1283522, which seems to actually be the most recurrent cause of error.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Sorry, I was not precise in my previous comment.
I wanted to write "it does not affect as much the largeops job".
The fact that the agents are interacting with the server makes it occur more frequently, but the problem does not lie with the agents interactions alone.

Tracy Jones (tjones-i)
tags: added: network
Revision history for this message
Dolph Mathews (dolph) wrote :

Removed keystone as it doesn't look like any of the above is unexpected behavior there.

no longer affects: keystone
Revision history for this message
Hua Zhang (zhhuabj) wrote :

I also find the same error in the nova's log with auth_strategy=keystone and neutron_auth_strategy=keystone being configured correctly.

$ sudo grep -r "ERROR" /var/log/nova/
/var/log/nova/conductor.log:2014-03-14 16:35:27.427 22075 ERROR oslo.messaging._executors.base [-] Exception during message handling
/var/log/nova/conductor.log:2014-03-14 16:35:27.443 22075 ERROR oslo.messaging._drivers.common [-] Returning exception Unknown auth strategy to caller
/var/log/nova/conductor.log:2014-03-14 16:35:27.444 22075 ERROR oslo.messaging._drivers.common [-] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.6/site-packages/oslo/messaging/_executors/base.py", line 36, in _dispatch\n incoming.reply(self.callback(incoming.ctxt, incoming.message))\n', ' File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 122, in __call__\n return self._dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python2.6/site-packages/oslo/messaging/rpc/dispatcher.py", line 92, in _dispatch\n result = getattr(endpoint, method)(ctxt, **new_args)\n', ' File "/usr/lib/python2.6/site-packages/nova/conductor/manager.py", line 516, in network_migrate_instance_finish\n self.network_api.migrate_instance_finish(context, instance, migration)\n', ' File "/usr/lib/python2.6/site-packages/nova/network/neutronv2/api.py", line 973, in migrate_instance_finish\n if not self._has_port_binding_extension(context, refresh_cache=True):\n', ' File "/usr/lib/python2.6/site-packages/nova/network/neutronv2/api.py", line 378, in _has_port_binding_extension\n self._refresh_neutron_extensions_cache(context)\n', ' File "/usr/lib/python2.6/site-packages/nova/network/neutronv2/api.py", line 370, in _refresh_neutron_extensions_cache\n extensions_list = neutron.list_extensions()[\'extensions\']\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/v2_0/client.py", line 111, in with_params\n ret = self.function(instance, *args, **kwargs)\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/v2_0/client.py", line 294, in list_extensions\n return self.get(self.extensions_path, params=_params)\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/v2_0/client.py", line 1236, in get\n headers=headers, params=params)\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/v2_0/client.py", line 1221, in retry_request\n headers=headers, params=params)\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/v2_0/client.py", line 1156, in do_request\n resp, replybody = self.httpclient.do_request(action, method, body=body)\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/client.py", line 195, in do_request\n self.authenticate()\n', ' File "/usr/lib/python2.6/site-packages/neutronclient/client.py", line 219, in authenticate\n raise exceptions.Unauthorized(message=_(\'Unknown auth strategy\'))\n', 'Unauthorized: Unknown auth strategy\n']
/var/log/nova/conductor.log:2014-03-15 01:58:01.870 22073 ERROR oslo.messaging._executors.base [-] Exception during message handling

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I think maybe we need to set the token expiration time be greater than the time the test run after having read the related code, looks all the logic are right. any thoughts ?

1, The non-admin user uses usename and password to invoke REST API to generate a token
1, when non-admin user request a REST API via WSGI, nova.api.auth:NovaKeystoneContext will get this token by name 'X_AUTH_TOKEN' from http header
3, The get_client method of $nova/network/neutronv2/__init__.py will use this token to Instantiate a neutron HTTPClient object.
   def _get_client(token=None):
    ... other code ...
   if token:
        params['token'] = token
        params['auth_strategy'] = None
    else:
        params['username'] = CONF.neutron_admin_username
        if CONF.neutron_admin_tenant_id:
            params['tenant_id'] = CONF.neutron_admin_tenant_id
        else:
            params['tenant_name'] = CONF.neutron_admin_tenant_name
            LOG.warning(_("Using neutron_admin_tenant_name for authentication "
                          "is deprecated and will be removed in the next "
                          "release. Use neutron_admin_tenant_id instead."))
        params['password'] = CONF.neutron_admin_password
        params['auth_url'] = CONF.neutron_admin_auth_url
        params['auth_strategy'] = CONF.neutron_auth_strategy
    return clientv20.Client(**params)

    def get_client(context, admin=False):
      ... other code ...
      if context.auth_token:
          token = context.auth_token
          return _get_client(token=token)
      ... other code ...
4, do_request method of neutronclient will invoke authenticate method if the token is unauthorized we passed.
   def do_request(self, url, method, **kwargs):
        self.authenticate_and_fetch_endpoint_url()
        try:
            kwargs.setdefault('headers', {})
            kwargs['headers']['X-Auth-Token'] = self.auth_token
            resp, body = self._cs_request(self.endpoint_url + url, method,
                                          **kwargs)
            return resp, body
        except exceptions.Unauthorized:
            self.authenticate()
5, authenticate method throws the Unauthorized exception normally because this is a unauthorized token.
def authenticate(self):
       if self.auth_strategy != 'keystone':
            raise exceptions.Unauthorized(message=_('Unknown auth strategy'))

Revision history for this message
Matt Riedemann (mriedem) wrote :

(1:33:54 PM) arosen: mriedem: for the overflow error i think you want to raise this: https://github.com/openstack/neutron/blob/stable/havana/etc/neutron.conf#L349

That's in reference to comment 2 and:

TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1297309 might be related here.

Revision history for this message
Brant Knudson (blk-u) wrote :

So what I think can happen here is that the token expires. There's not much that nova can do about it. The auth_token middleware accepts the token because it's not expired, then it expires and nova tries to use it to do something (a neutron operation for example).

It might be a good idea to have nova reject tokens that are about to expire, rather than continue using them.

Revision history for this message
Hua Zhang (zhhuabj) wrote :

this patch should have fixed this issue since auth_strategy will no longer be set to None when using admin user, thus neutron-client will re-try to re-authenticate from keystone when admin token is expired.

https://review.openstack.org/gitweb?p=openstack%2Fnova.git;a=commitdiff;h=b8371d5241afb7bd02be1d3e610132525cf1633b

Changed in neutron:
status: New → Incomplete
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

need to set 'api_workers = 10' in /etc/neutron/neutron.conf, it works on my environment (when I found the same problem in https://bugs.launchpad.net/neutron/+bug/1321626).

Brent Eagles (beagles)
tags: added: neutron
Revision history for this message
Sean Dague (sdague) wrote :

Are there full logs somewhere to see what's going on in this scenario. Like the full contents of doing this in devstack. Symptomatic analysis won't get to the bottom of this one.

Changed in nova:
status: New → Incomplete
Revision history for this message
Joe Gordon (jogo) wrote :

Unable to reproduce this (while using the fake virt driver) looks like https://review.openstack.org/gitweb?p=openstack%2Fnova.git;a=commitdiff;h=b8371d5241afb7bd02be1d3e610132525cf1633b may have fixed this.

Joe Gordon (jogo)
Changed in nova:
status: Incomplete → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-1 → 2015.1.0
Ben Nemec (bnemec)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

no longer affects: neutron
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.