nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached

Bug #1251784 reported by Joe Gordon
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Robert Collins
tripleo
Fix Released
Critical
Unassigned

Bug Description

VMs are failing to schedule with the following error

2013-11-15 20:50:21.405 ERROR nova.scheduler.filter_scheduler [req-d2c26348-53e6-448a-8975-4f22f4e89782 demo demo] [instance: c8069c13-593f-48fb-aae9-198961097eb2] Error from last host: devstack-precise-hpcloud-az3-662002 (node devstack-precise-hpcloud-az3-662002): [u'Traceback (most recent call last):\n', u' File "/opt/stack/new/nova/nova/compute/manager.py", line 1030, in _build_instance\n set_access_ip=set_access_ip)\n', u' File "/opt/stack/new/nova/nova/compute/manager.py", line 1439, in _spawn\n LOG.exception(_(\'Instance failed to spawn\'), instance=instance)\n', u' File "/opt/stack/new/nova/nova/compute/manager.py", line 1436, in _spawn\n block_device_info)\n', u' File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2100, in spawn\n admin_pass=admin_password)\n', u' File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2451, in _create_image\n content=files, extra_md=extra_md, network_info=network_info)\n', u' File "/opt/stack/new/nova/nova/api/metadata/base.py", line 165, in __init__\n ec2utils.get_ip_info_for_instance_from_nw_info(network_info)\n', u' File "/opt/stack/new/nova/nova/api/ec2/ec2utils.py", line 149, in get_ip_info_for_instance_from_nw_info\n fixed_ips = nw_info.fixed_ips()\n', u' File "/opt/stack/new/nova/nova/network/model.py", line 368, in _sync_wrapper\n self.wait()\n', u' File "/opt/stack/new/nova/nova/network/model.py", line 400, in wait\n self[:] = self._gt.wait()\n', u' File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 168, in wait\n return self._exit_event.wait()\n', u' File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 120, in wait\n current.throw(*self._exc)\n', u' File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 194, in main\n result = function(*args, **kwargs)\n', u' File "/opt/stack/new/nova/nova/compute/manager.py", line 1220, in _allocate_network_async\n dhcp_options=dhcp_options)\n', u' File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 359, in allocate_for_instance\n nw_info = self._get_instance_nw_info(context, instance, networks=nets)\n', u' File "/opt/stack/new/nova/nova/network/api.py", line 49, in wrapper\n res = f(self, context, *args, **kwargs)\n', u' File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 458, in _get_instance_nw_info\n nw_info = self._build_network_info_model(context, instance, networks)\n', u' File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 1022, in _build_network_info_model\n subnets = self._nw_info_get_subnets(context, port, network_IPs)\n', u' File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 924, in _nw_info_get_subnets\n subnets = self._get_subnets_from_port(context, port)\n', u' File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 1066, in _get_subnets_from_port\n data = neutronv2.get_client(context).list_ports(**search_opts)\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 111, in with_params\n ret = self.function(instance, *args, **kwargs)\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 306, in list_ports\n **_params)\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1250, in list\n for r in self._pagination(collection, path, **params):\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1263, in _pagination\n res = self.get(path, params=params)\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1236, in get\n headers=headers, params=params)\n', u' File "/opt/stack/new/python-neutronclient/neutronclient/v2_0/client.py", line 1228, in retry_request\n raise exceptions.ConnectionFailed(reason=_("Maximum attempts reached"))\n', u'ConnectionFailed: Connection to neutron failed: Maximum attempts reached\n']

Connection to neutron failed: Maximum attempts reached

http://logs.openstack.org/96/56496/1/gate/gate-tempest-devstack-vm-neutron-isolated/8df6c6c/logs/screen-n-sch.txt.gz#_2013-11-15_20_50_21_405

logstash query: "Connection to neutron failed: Maximum attempts reached" AND filename:"logs/screen-n-sch.txt"

Revision history for this message
Joe Gordon (jogo) wrote :
Matt Riedemann (mriedem)
tags: added: network scheduler
Revision history for this message
yong sheng gong (gongysh) wrote :

how do I run the gate-tempest-devstack-vm-neutron-isolated under devstack env?

Revision history for this message
Thierry Carrez (ttx) wrote :

328 Hits

Changed in nova:
importance: Undecided → Critical
Changed in neutron:
importance: Undecided → Critical
Changed in nova:
milestone: none → icehouse-1
Changed in neutron:
milestone: none → icehouse-1
Revision history for this message
Derek Higgins (derekh) wrote :
Download full text (5.5 KiB)

I'm seeing something similar on Tripleo/Fedora

 ERROR nova.api.openstack [req-b249d867-efa9-4b69-bb00-bcd188d35d95 f6aafe318df1469498437c69e9f20206 677e5e66567a4d1ebe0b1946452a03ef] Ca
 TRACE nova.api.openstack Traceback (most recent call last):
 TRACE nova.api.openstack File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/api/openstack/__init__.py", line 120, in __call_
 TRACE nova.api.openstack return req.get_response(self.application)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/request.py", line 1296, in send
 TRACE nova.api.openstack application, catch_exc_info=False)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/request.py", line 1260, in call_application
 TRACE nova.api.openstack app_iter = application(self.environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
 TRACE nova.api.openstack return resp(environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/keystoneclient/middleware/auth_token.py", line 571, in __call__
 TRACE nova.api.openstack return self.app(env, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
 TRACE nova.api.openstack return resp(environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
 TRACE nova.api.openstack return resp(environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/routes/middleware.py", line 131, in __call__
 TRACE nova.api.openstack response = self.app(environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 144, in __call__
 TRACE nova.api.openstack return resp(environ, start_response)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 130, in __call__
 TRACE nova.api.openstack resp = self.call_func(req, *args, **self.kwargs)
 TRACE nova.api.openstack File "/usr/lib/python2.7/site-packages/webob/dec.py", line 195, in call_func
 TRACE nova.api.openstack return self.func(req, *args, **kwargs)
 TRACE nova.api.openstack File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 940, in __call__
 TRACE nova.api.openstack content_type, body, accept)
 TRACE nova.api.openstack File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 1025, in _process_s
 TRACE nova.api.openstack request, action_args)
 TRACE nova.api.openstack File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 913, in post_proces
 TRACE nova.api.openstack **action_args)
 TRACE nova.api.openstack File "/opt/stack/venvs/nova/lib/python2.7/site-packages/nova/api/openstack/compute/contrib/security_groups.py
 TRAC...

Read more...

Revision history for this message
Derek Higgins (derekh) wrote :
Derek Higgins (derekh)
Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Derek Higgins (derekh) wrote :

These seem to have started around the same time as this commit
https://review.openstack.org/#/c/56075/

Caching the Neutron Client could explain how its possible for http requests/responses to start happening out of order (see tcpdump above).

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Derek - that was actually why I asked if the bug was somehow a re-surfacing of bug 1211915

The change you linked was introduced to remove an awful lot of roundtrips to keystone which made the largeops jobs fail 100% of times; on the other hand I think admin client caching was removed during havana cycle as it was related to bug 1211915.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Salvatore, Derek, I think there is an underlying error we are not logging - https://review.openstack.org/#/c/56316/ Can we push this in and see what we log?

-- dims

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Copy pasting this from my email:

I think I found a serious bug in our usage of eventlet thread local
storage. Please check out this snippet [1].

This is how we use eventlet TLS in Nova and common Oslo code [2]. This
could explain how [3] actually breaks TripleO devtest story and our
gates.

Am I right? Or I am missing something and should get some sleep? :)

[1] http://paste.openstack.org/show/53686/
[2] https://github.com/openstack/nova/blob/master/nova/openstack/common/local.py#L48
[3] https://github.com/openstack/nova/commit/85332012dede96fa6729026c2a90594ea0502ac5

Revision history for this message
Peter Feiner (pete5) wrote :

I'm pretty sure that I reported another instance of the same bug: https://bugs.launchpad.net/nova/+bug/1252846.

I modified python-neutronclient to raise the underlying exception instead of the generic "Maximum attempts reached" and observed that it's always something related to racy uses of python-neutronclient file descriptors (e.g., double close, two greenthreads readfrom from the same file descriptor). Like you guys are saying, these errors are strongly indicative of Clients being shared when they aren't supposed to be (i.e., incorrect use of TLS).

Nova-compute's use of TLS with neutron clients seemed pretty straightforward, so I dismissed it as the problem. However, given your references, I seem to have ruled it out incorrectly. I'll plug away at it locally and report back.

Revision history for this message
Peter Feiner (pete5) wrote :

According to my local testing, this bug seems to be fixed by https://review.openstack.org/#/c/57509/2.

Revision history for this message
Robert Collins (lifeless) wrote :

This is likely a dupe of bug 1251920 then

Revision history for this message
Matt Riedemann (mriedem) wrote :

Yes, given https://review.openstack.org/56075 merged on 11/14 and the gate issues started happening on the same day according to:

http://bit.ly/1c2lEIr

We actually have a team internally doing scale testing and they were hitting this same problem, and came to the same conclusions.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Per notes on email, https://review.openstack.org/#/c/57509/ should have fixed this one as well

Revision history for this message
Joe Gordon (jogo) wrote :

This is not a duplicate of 1251920, the fix for this did not fix 1251920.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok, so this was a bug in Nova and it is fixed now. What about Neutron and TripleO? Should we remove them from the list of projects affected? Just don't want to see this one on the top of TripleO bugs, given the fact it's not a TripleO bug and it's fixed ;)

Revision history for this message
Matt Riedemann (mriedem) wrote :

(8:28:13 AM) anteaya: mriedem: I checked all merged nova patches back to the 20th - I don't see a merged patch that references bug 1251784
(8:29:07 AM) mriedem: anteaya: https://review.openstack.org/#/c/57509/ fixes 1251784 but was incorrectly marked as fixing 1251920
(8:29:26 AM) mriedem: anteaya: the workaround for 1251920 was by skipping a test in tempest, looking that commit up now
(8:30:11 AM) mriedem: anteaya: this is the stopgap workaround for 1251920: https://review.openstack.org/#/c/57193/
(8:31:07 AM) mriedem: anteaya: the tempest skip is explained here: http://lists.openstack.org/pipermail/openstack-dev/2013-November/020095.html
(8:32:25 AM) anteaya: mriedem: okay yes, am just trying to ascertain if 1251784 needs more effort from -neutron, they are asking (yay) and I want to give them good information

Changed in nova:
status: New → Fix Committed
assignee: nobody → Robert Collins (lifeless)
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Changed in tripleo:
status: New → Triaged
status: Triaged → Fix Released
Revision history for this message
Anita Kuno (anteaya) wrote :
Revision history for this message
Anita Kuno (anteaya) wrote :
Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → icehouse-2
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Mark McClain (markmcclain) wrote :
no longer affects: neutron
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Really spiked back up around 4/17 and was back down by 4/19.

no longer affects: neutron
Revision history for this message
Matt Riedemann (mriedem) wrote :

I wonder if this could be related? https://review.openstack.org/88484

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is also showing up with about 50% failures:

http://logstash.openstack.org/#eyJmaWVsZHMiOltdLCJzZWFyY2giOiJtZXNzYWdlOlwiTG9jayB3YWl0IHRpbWVvdXQgZXhjZWVkZWQ7IHRyeSByZXN0YXJ0aW5nIHRyYW5zYWN0aW9uXCIgQU5EIHRhZ3M6c2NyZWVuLXEtc3ZjLnR4dCIsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50Iiwib2Zmc2V0IjowLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5ODM3MDY0ODU0N30=

Sounds like there are lock timeout issues with the database in neutron, maybe due to chatty agents, and I'm thinking neutron needs to wrap whatever is blowing up with a retry_on_deadlock decorator like nova does in it's db api layer.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

This is definitely a neutron error.
I will take the logstash query and open new bugs, if that's the case.
I don't think it makes sense to reopen this bug, because it's already marked as released and also we'd rather have something more specific as bug description.

logstash also reveals no failure in gate queue - this is because almost all failures are in the neutron full job.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I've looked at the failures in the past week and they're all "Lock wait timeout" in Neutron.
However, there are several difference occurrences of the timeout error, and I will therefore file different bugs.
I am a bit gutted that the one we fixed - on updating port status - appears to be back.

I think from a gate stability perspective there is no risk at the moment as 99% of the failures is in the full jobs (which we'll enable soon however for neutron only).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.