Connection to neutron failed: Maximum attempts reached

Bug #1211915 reported by Bob Ball
118
This bug affects 23 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Jakub Libosvar
Havana
Fix Released
High
ytycy

Bug Description

http://logs.openstack.org/64/41464/4/check/gate-tempest-devstack-vm-neutron/4288a6b/console.html

Seen testing https://review.openstack.org/#/c/41464/

2013-08-13 17:34:46.774 | Traceback (most recent call last):
2013-08-13 17:34:46.774 | File "tempest/scenario/test_network_basic_ops.py", line 176, in test_003_create_networks
2013-08-13 17:34:46.774 | router = self._get_router(self.tenant_id)
2013-08-13 17:34:46.775 | File "tempest/scenario/test_network_basic_ops.py", line 141, in _get_router
2013-08-13 17:34:46.775 | router.add_gateway(network_id)
2013-08-13 17:34:46.775 | File "tempest/api/network/common.py", line 78, in add_gateway
2013-08-13 17:34:46.776 | self.client.add_gateway_router(self.id, body=body)
2013-08-13 17:34:46.776 | File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 108, in with_params
2013-08-13 17:34:46.776 | ret = self.function(instance, *args, **kwargs)
2013-08-13 17:34:46.776 | File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 396, in add_gateway_router
2013-08-13 17:34:46.777 | body={'router': {'external_gateway_info': body}})
2013-08-13 17:34:46.777 | File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 987, in put
2013-08-13 17:34:46.777 | headers=headers, params=params)
2013-08-13 17:34:46.778 | File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 970, in retry_request
2013-08-13 17:34:46.778 | raise exceptions.ConnectionFailed(reason=_("Maximum attempts reached"))
2013-08-13 17:34:46.778 | ConnectionFailed: Connection to neutron failed: Maximum attempts reached

Tags: l3-ipam-dhcp
Revision history for this message
Attila Fazekas (afazekas) wrote :

http://logs.openstack.org/64/41464/4/check/gate-tempest-devstack-vm-neutron/4288a6b/logs/screen-q-svc.txt.gz

Search for the '2013-08-13 17:16:43.835 28151 ERROR neutron.openstack.common.rpc.amqp [-] Exception during message handling' It is very likely happened at the test case time.

The exception printed out later, the scenario test ran at the above time.

'OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE agents SET heartbeat_timestamp=%s WHERE agents.id = %s' (datetime.datetime(2013, 8, 13, 17, 15, 52, 725707), 'd6b24090-dfa9-44ee-ac1e-f6b5681e846b')'

Revision history for this message
Attila Fazekas (afazekas) wrote :

Ye can see several 50-51 second query like the bellow one in the http://logs.openstack.org/64/41464/4/check/gate-tempest-devstack-vm-neutron/4288a6b/logs/mysql/mysql-slow.txt.gz.

50 sec sounds like a very huge wait time.

# Time: 130813 17:16:43
# User@Host: root[root] @ localhost []
# Query_time: 51.076234 Lock_time: 51.076117 Rows_sent: 0 Rows_examined: 0
use ovs_neutron;
SET timestamp=1376414203;
UPDATE agents SET heartbeat_timestamp='2013-08-13 17:15:52' WHERE agents.id = 'd6b24090-dfa9-44ee-ac1e-f6b5681e846b';

The transactions close to the above one is rollbacked.

An another slow query:

 User@Host: root[root] @ localhost []
# Query_time: 50.992815 Lock_time: 0.000033 Rows_sent: 0 Rows_examined: 0
use ovs_neutron;
SET timestamp=1376414866;
INSERT INTO routerl3agentbindings (id, router_id, l3_agent_id) VALUES ('bba9b6b1-0ae8-47bc-81f0-04e9df5571d5', 'c410a9e5-5394-4aa9-9fb7-fee346c9a79a', 'd6b24090-dfa9-44ee-ac1e-f6b5681e846b');

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

There is 20 rechecks on gating. I'll mark this as critical bug.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Changed in neutron:
importance: Critical → High
Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Now 40 rechecks!

Changed in neutron:
importance: High → Critical
Revision history for this message
Joe Gordon (jogo) wrote :

logstash query (logstash.openstack.org):
@message:"ConnectionFailed: Connection to neutron failed: Maximum attempts reached" AND @fields.filename:"console.html"

Revision history for this message
Joe Gordon (jogo) wrote :

This should be targeted to Havana RC1

Revision history for this message
Nachi Ueno (nati-ueno) wrote :

Thanks Joe I targeted this one

Changed in neutron:
milestone: none → havana-rc1
Changed in neutron:
assignee: nobody → Maru Newby (maru)
Revision history for this message
Joe Gordon (jogo) wrote :

This is up to 57 rechecks

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

@message:"Connection to neutron failed" has 425 hits. First recorded instance has a timestamp 2013-09-13T01:47:45.612Z I have a strong feeling this bug is related to https://bugs.launchpad.net/nova/+bug/1229475 (First hit was on 2013-09-12T09:19:53.000)

Revision history for this message
Joe Gordon (jogo) wrote :

Dims, first hit is misleading as we only have two weeks of logs in ElasticSearch

Revision history for this message
Thierry Carrez (ttx) wrote :

RC1 was signed-off, moving to havana-rc-potential

Changed in neutron:
milestone: havana-rc1 → none
tags: added: havana-rc-potential
Changed in neutron:
assignee: Maru Newby (maru) → Jakub Libosvar (libosvar)
status: Confirmed → In Progress
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

As mentioned in comment #9, this got fixed with the patch to neutron client for bogus fd.

https://bugs.launchpad.net/nova/+bug/1229475/comments/14

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Jakub Libosvar (libosvar) wrote :

The root cause was deadlock on server side. Proposed fix is https://review.openstack.org/#/c/49596/

Thierry Carrez (ttx)
tags: added: havana-backport-potential
removed: havana-rc-potential
tags: added: l3-ipam-dhcp
Changed in neutron:
milestone: none → icehouse-1
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/49596
Committed: http://github.com/openstack/neutron/commit/5becbbec6a686667925f399904cebd97b583c35a
Submitter: Jenkins
Branch: master

commit 5becbbec6a686667925f399904cebd97b583c35a
Author: Jakub Libosvar <email address hidden>
Date: Thu Oct 3 19:45:09 2013 +0200

    Removing rpc communication from db transaction

    In a transaction context of ExtraRoute_db_mixin.update_router() was
    called super method that uses subtransaction and communication with l3
    agent. In case of agent's heartbeat happens while there is running
    transaction, update heartbeat in agents table gets stuck in a deadlock.

    Closes-Bug: #1211915
    Change-Id: I96e6a9d7172d5a0e3e720a81fcd10f04c40aef07

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/havana)

Fix proposed to branch: stable/havana
Review: https://review.openstack.org/52483

Revision history for this message
leonce (leon-li2) wrote :
Download full text (13.0 KiB)

I met this problem also:
root@xxxxxx:~# neutron agent-list
Connection to neutron failed: Maximum attempts reached
Actually it would have same failure when run any neutron command.

Neutron-server service is running. However it’s not listening port 9696.
/var/log/neutron/server.log is empty.
Any idea?

root@xxxxxx:~# service neutron-server status
neutron-server start/running, process 6926
root@xxxxxx:~# netstat -l |grep 9696
root@xxxxxx:~# telnet 127.0.0.1 9696
Trying 127.0.0.1...
telnet: Unable to connect to remote host: Connection refused
root@xxxxxx:~# tail /var/log/neutron/
dhcp-agent.log l3-agent.log metadata-agent.log
dhcp-agent.log.1 l3-agent.log.1 server.log
dhcp-agent.log.2.gz l3-agent.log.2.gz server.log.1
dhcp-agent.log.3.gz l3-agent.log.3.gz
dhcp-agent.log.4.gz l3-agent.log.4.gz
root@xxxxxx:~# tail /var/log/neutron/server.log
root@xxxxxx:~#
root@xxxxxx:~# service rabbitmq-server status
Status of node rabbit@i2052247 ...
[{pid,1779},
{running_applications,[{rabbit,"RabbitMQ","2.7.1"},
                        {os_mon,"CPO CXC 138 46","2.2.7"},
                        {sasl,"SASL CXC 138 11","2.1.10"},
                        {mnesia,"MNESIA CXC 138 12","4.5"},
                        {stdlib,"ERTS CXC 138 10","1.17.5"},
                        {kernel,"ERTS CXC 138 10","2.14.5"}]},
{os,{unix,linux}},
{erlang_version,"Erlang R14B04 (erts-5.8.5) [source] [64-bit] [rq:1] [async-threads:30] [kernel-poll:true]\n"},
{memory,[{total,26175728},
          {processes,11653152},
          {processes_used,11644440},
          {system,14522576},
          {atom,1124433},
          {atom_used,1120215},
          {binary,126024},
          {code,11134393},
          {ets,819856}]},
{vm_memory_high_watermark,0.3999999994248726},
{vm_memory_limit,417298841}]
...done.
root@xxxxxx:~# neutron -v agent-list
DEBUG: neutronclient.neutron.v2_0.agent.ListAgent get_data(Namespace(columns=[], fields=[], formatter='table', quote_mode='nonnumeric', request_format='json', show_details=False))
DEBUG: neutronclient.client
REQ: curl -i http://10.246.52.247:5000/v2.0/tokens -X POST -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: python-neutronclient" -d '{"auth": {"tenantName": "admin", "passwordCredentials": {"username": "admin", "password": "admin_pass"}}}'

DEBUG: neutronclient.client RESP:{'date': 'Fri, 18 Oct 2013 09:03:11 GMT', 'vary': 'X-Auth-Token', 'content-length': '6421', 'status': '200', 'content-type': 'application/json'} {"access": {"token": {"issued_at": "2013-10-18T09:03:11.702810", "expires": "2013-10-19T09:03:11Z", "id": "MIILgQYJKoZIhvcNAQcCoIILcjCCC24CAQExCTAHBgUrDgMCGjCCCdcGCSqGSIb3DQEHAaCCCcgEggnEeyJhY2Nlc3MiOiB7InRva2VuIjogeyJpc3N1ZWRfYXQiOiAiMjAxMy0xMC0xOFQwOTowMzoxMS43MDI4MTAiLCAiZXhwaXJlcyI6ICIyMDEzLTEwLTE5VDA5OjAzOjExWiIsICJpZCI6ICJwbGFjZWhvbGRlciIsICJ0ZW5hbnQiOiB7ImRlc2NyaXB0aW9uIjogbnVsbCwgImVuYWJsZWQiOiB0cnVlLCAiaWQiOiAiOWUwMTYyMzVjZTYyNDRhYWExOGJhOGQxN2MzMDM1MmUiLCAibmFtZSI6ICJhZG1pbiJ9fSwgInNlcnZpY2VDYXRhbG9nIjogW3siZW5kcG9pbnRzIjogW3siYWRtaW5VUkwiOiAiaHR0cDovLzE5Mi4xNjguMS4yNDc6ODc3NC92Mi85ZTAxNjIzNWNlNjI0NGFhYTE4YmE4ZDE3YzMwMzUyZSIsICJyZWdpb24iOiAiUmVnaW9uT25lI...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/havana)

Reviewed: https://review.openstack.org/52483
Committed: http://github.com/openstack/neutron/commit/02cd640e4a7ef9b199fc36f38dd546763e4b1b25
Submitter: Jenkins
Branch: stable/havana

commit 02cd640e4a7ef9b199fc36f38dd546763e4b1b25
Author: Jakub Libosvar <email address hidden>
Date: Thu Oct 3 19:45:09 2013 +0200

    Removing rpc communication from db transaction

    In a transaction context of ExtraRoute_db_mixin.update_router() was
    called super method that uses subtransaction and communication with l3
    agent. In case of agent's heartbeat happens while there is running
    transaction, update heartbeat in agents table gets stuck in a deadlock.

    Closes-Bug: #1211915
    Change-Id: I96e6a9d7172d5a0e3e720a81fcd10f04c40aef07
    (cherry picked from commit 5becbbec6a686667925f399904cebd97b583c35a)

tags: added: in-stable-havana
Revision history for this message
Dazhao Yu (dzyu) wrote :

I still encountered this issue after use the merged code.
The error was happened after I restarted neutron-server service several times.

You can tried to restart neutron-server again and again(about 15 times), then run "neutron net-list", it will thread this error. And the neutron service status is neutron dead but pid file exists

Changed in neutron:
status: Fix Committed → Incomplete
Revision history for this message
Dazhao Yu (dzyu) wrote :

Sorry,it is caused by my db env failed.

Changed in neutron:
status: Incomplete → Fix Committed
Alan Pevec (apevec)
tags: removed: havana-backport-potential in-stable-havana
Revision history for this message
JoeJiang (shunqiaoj) wrote :

I met this problem when I run "neutron agent-list", Connection to neutron failed: Maximum attempts reached!

In ovs_neutron_plugin.ini,
#[database]
connection = ************
then work done.

I don't why...

[database]
connection = ********
Still report error.

Revision history for this message
Hua Zhang (zhhuabj) wrote :
Download full text (8.8 KiB)

I don't think we have found root cause for this issue since I still hit it as well after confirming I have applied the patch https://review.openstack.org/#/c/49210/
# grep ".clear" -B 5 /usr/lib/python2.6/site-packages/neutronclient/client.py
            raise exceptions.ConnectionFailed(reason=e)
        finally:
            # Temporary Fix for gate failures. RPC calls and HTTP requests
            # seem to be stepping on each other resulting in bogus fd's being
            # picked up for making http requests
            self.connections.clear()

the following error log I collected by applying my bellow patch shows that we are still hitting the error "Second simultaneous read on fileno" described by the link http://blog.eventlet.net/2010/03/18/safety/

diff --git a/neutronclient/v2_0/client.py b/neutronclient/v2_0/client.py
index 733f339..4d95a84 100644
--- a/neutronclient/v2_0/client.py
+++ b/neutronclient/v2_0/client.py
@@ -1219,10 +1219,14 @@ class Client(object):
             try:
                 return self.do_request(method, action, body=body,
                                        headers=headers, params=params)
- except exceptions.ConnectionFailed:
- # Exception has already been logged by do_request()
+ except exceptions.ConnectionFailed as e:
+ # Actually exception has not been logged by do_request(),
+ # we do it here temporarily to locate the problem
+ _logger.error("************ zhhuabj begin **************")
+ _logger.error(e)
+ _logger.error("************ zhhuabj end **************")
                 if i < self.retries:
- _logger.debug(_('Retrying connection to Neutron service'))
+ _logger.error(_('Retrying connection to Neutron service'))
                     time.sleep(self.retry_interval)

         raise exceptions.ConnectionFailed(reason=_("Maximum attempts reached"))

2013-11-17 03:42:01.436 11335 ERROR neutronclient.v2_0.client [-] ************ zhhuabj begin **************
2013-11-17 03:42:01.436 11335 ERROR neutronclient.v2_0.client [-] Connection to neutron failed: Second simultaneous read on fileno 42 detected. Unless you really know what you're doing, make sure that only one greenthread can read any particular socket. Consider using a pools.Pool. If you do know what you're doing and want to disable this error, call eventlet.debug.hub_prevent_multiple_readers(False)
2013-11-17 03:42:01.436 11335 ERROR neutronclient.v2_0.client [-] ************ zhhuabj end **************
2013-11-17 03:42:01.437 11335 ERROR nova.api.openstack [req-fbe8ae49-9dc0-4c62-aea0-164e8a11160c 328702c5a15a4d14a49959f6ed5ec4b0 68ad27cec264496fbc0ddd2bde659216] NV-A68A08C Caught error: Connection to neutron failed: Maximum attempts reached
2013-11-17 03:42:01.437 11335 TRACE nova.api.openstack Traceback (most recent call last):
2013-11-17 03:42:01.437 11335 TRACE nova.api.openstack File "/usr/lib/python2.6/site-packages/nova/api/openstack/__init__.py", line 119, in __call__
2013-11-17 03:42:01.437 11335 TRACE nova.api.openstack return req.get_response(self.application)
2013-11-17 03:42:01...

Read more...

Revision history for this message
Hua Zhang (zhhuabj) wrote :

1, I confirmed I have applied the patch https://review.openstack.org/#/c/49210/
# grep ".clear" -B 5 /usr/lib/python2.6/site-packages/neutronclient/client.py
            raise exceptions.ConnectionFailed(reason=e)
        finally:
            # Temporary Fix for gate failures. RPC calls and HTTP requests
            # seem to be stepping on each other resulting in bogus fd's being
            # picked up for making http requests
            self.connections.clear()

2, I confirmed I have also applied the patch https://review.openstack.org/#/c/48552/
# grep "eventlet.import_patched('httplib2')" -B 3 /usr/lib/python2.6/site-packages/neutronclient/client.py
try:
    import eventlet
    if eventlet.patcher.is_monkey_patched('socket'):
        httplib2 = eventlet.import_patched('httplib2')

3, I confirmed the code of $nova/nova/network/neutronv2/__init__.py seems doesn't exist the case of sharing the socket among different greenthreads
# grep "_get_client" -A 100 /usr/lib/python2.6/site-packages/nova/network/neutronv2/__init__.py
def _get_client(token=None):
    params = {
        'endpoint_url': CONF.neutron_url,
        'timeout': CONF.neutron_url_timeout,
        'insecure': CONF.neutron_api_insecure,
        'ca_cert': CONF.neutron_ca_certificates_file,
    }

    if token:
        params['token'] = token
        params['auth_strategy'] = None
    else:
        params['username'] = CONF.neutron_admin_username
        params['tenant_name'] = CONF.neutron_admin_tenant_name
        params['password'] = CONF.neutron_admin_password
        params['auth_url'] = CONF.neutron_admin_auth_url
        params['auth_strategy'] = CONF.neutron_auth_strategy
    return clientv20.Client(**params)

def get_client(context, admin=False):
    if admin or context.is_admin:
        if not hasattr(local.strong_store, 'neutron_client'):
            local.strong_store.neutron_client = _get_client(token=None)
        return local.strong_store.neutron_client
    if context.auth_token:
        token = context.auth_token
        return _get_client(token=token)

    raise exceptions.Unauthorized()

so don't know why the error "Second simultaneous read on fileno" described by the link http://blog.eventlet.net/2010/03/18/safety/ still exist. any idea greatly appreciated :)

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.