500 error when trying to list instances and neutron-server is down

Bug #1567655 reported by Matt Riedemann
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

This is a newton devstack created today running neutron + ovs:

1. create an instance, wait for it to go active
2. list instances, it's fine
3. stop all neutron-server processes
4. list instances - it fails with a 500 trying to process security groups because it can't connect to neutron:

http://paste.openstack.org/show/493416/

2016-04-07 20:30:54.272 ERROR nova.api.openstack [req-5d867a48-2097-456f-a2ae-f93c982ac5d0 admin admin] Caught error: Unable to establish connection to http://9.5.124.163:9696/v2.0/ports.json?device_id=cda8a8ac-6eac-434d-bded-a0b34d285f41
2016-04-07 20:30:54.272 TRACE nova.api.openstack Traceback (most recent call last):
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/__init__.py", line 134, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack return req.get_response(self.application)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1317, in send
2016-04-07 20:30:54.272 TRACE nova.api.openstack application, catch_exc_info=False)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1281, in call_application
2016-04-07 20:30:54.272 TRACE nova.api.openstack app_iter = application(self.environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack return resp(environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack resp = self.call_func(req, *args, **self.kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2016-04-07 20:30:54.272 TRACE nova.api.openstack return self.func(req, *args, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/keystonemiddleware/auth_token/__init__.py", line 467, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack response = req.get_response(self._app)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1317, in send
2016-04-07 20:30:54.272 TRACE nova.api.openstack application, catch_exc_info=False)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/request.py", line 1281, in call_application
2016-04-07 20:30:54.272 TRACE nova.api.openstack app_iter = application(self.environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack return resp(environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack return resp(environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/routes/middleware.py", line 141, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack response = self.app(environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack return resp(environ, start_response)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack resp = self.call_func(req, *args, **self.kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2016-04-07 20:30:54.272 TRACE nova.api.openstack return self.func(req, *args, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 672, in __call__
2016-04-07 20:30:54.272 TRACE nova.api.openstack content_type, body, accept)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 756, in _process_stack
2016-04-07 20:30:54.272 TRACE nova.api.openstack request, action_args)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 619, in post_process_extensions
2016-04-07 20:30:54.272 TRACE nova.api.openstack **action_args)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/compute/security_groups.py", line 491, in detail
2016-04-07 20:30:54.272 TRACE nova.api.openstack self._extend_servers(req, list(resp_obj.obj['servers']))
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/api/openstack/compute/security_groups.py", line 461, in _extend_servers
2016-04-07 20:30:54.272 TRACE nova.api.openstack servers))
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/network/security_group/neutron_driver.py", line 381, in get_instances_security_groups_bindings
2016-04-07 20:30:54.272 TRACE nova.api.openstack ports = self._get_ports_from_server_list(servers, neutron)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/nova/nova/network/security_group/neutron_driver.py", line 340, in _get_ports_from_server_list
2016-04-07 20:30:54.272 TRACE nova.api.openstack ports.extend(neutron.list_ports(**search_opts).get('ports'))
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 618, in list_ports
2016-04-07 20:30:54.272 TRACE nova.api.openstack **_params)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 357, in list
2016-04-07 20:30:54.272 TRACE nova.api.openstack for r in self._pagination(collection, path, **params):
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 372, in _pagination
2016-04-07 20:30:54.272 TRACE nova.api.openstack res = self.get(path, params=params)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 342, in get
2016-04-07 20:30:54.272 TRACE nova.api.openstack headers=headers, params=params)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 319, in retry_request
2016-04-07 20:30:54.272 TRACE nova.api.openstack headers=headers, params=params)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/v2_0/client.py", line 270, in do_request
2016-04-07 20:30:54.272 TRACE nova.api.openstack resp, replybody = self.httpclient.do_request(action, method, body=body)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/client.py", line 306, in do_request
2016-04-07 20:30:54.272 TRACE nova.api.openstack return self.request(url, method, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/opt/stack/python-neutronclient/neutronclient/client.py", line 294, in request
2016-04-07 20:30:54.272 TRACE nova.api.openstack resp = super(SessionClient, self).request(*args, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/keystoneauth1/adapter.py", line 98, in request
2016-04-07 20:30:54.272 TRACE nova.api.openstack return self.session.request(url, method, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/positional/__init__.py", line 94, in inner
2016-04-07 20:30:54.272 TRACE nova.api.openstack return func(*args, **kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/keystoneauth1/session.py", line 452, in request
2016-04-07 20:30:54.272 TRACE nova.api.openstack resp = send(**kwargs)
2016-04-07 20:30:54.272 TRACE nova.api.openstack File "/usr/local/lib/python2.7/dist-packages/keystoneauth1/session.py", line 496, in _send_request
2016-04-07 20:30:54.272 TRACE nova.api.openstack raise exceptions.ConnectFailure(msg)
2016-04-07 20:30:54.272 TRACE nova.api.openstack ConnectFailure: Unable to establish connection to http://9.5.124.163:9696/v2.0/ports.json?device_id=cda8a8ac-6eac-434d-bded-a0b34d285f41
2016-04-07 20:30:54.272 TRACE nova.api.openstack

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: network security-groups
Erik Neeley (erikneeley)
Changed in nova:
status: Confirmed → Incomplete
status: Incomplete → Confirmed
Changed in nova:
assignee: nobody → Sreenivas (sreenivas-pothukanoori)
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

Solving an inconsistency: This bug report has an assignee and it looks
like this will result in a patch. Therefore I switch the status to
"In Progress".
Dear assignee, please provide a patch in the next 2 weeks. If you stop
working on this report, please remove yourself as assignee and switch
the status back. If you need assistance, reach out on the IRC channel
#openstack-nova or use the mailing list.

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Sreenivas (sreenivas-pothukanoori) wrote :

Hi,

Please let me know on which version of openstack(Kilo/Liberty) this issue has been observed.

Also, please let me know what is the expected behavior of the 4th step(step4, mentioned in the above bug description).

Revision history for this message
Matt Riedemann (mriedem) wrote :

Per comment 2, this is newton, as mentioned in the description. Newton devstack created on 2016-04-07, so trunk master code as of that day.

The expected behavior is that we don't get a 500. Listing instances should more or less be a database operation, the issue is in the case of using neutron as the network service, we call out to neutron to get security group details per instance. Since neutron-server is down, that API call fails, Nova doesn't handle it and we return a 500 response.

Dan Smith (dansmith in #openstack-nova on freenode IRC) suggested caching the security groups for neutron, in a new column in the instance_info_caches table in the nova database. That is going to be a more complicated change though (requires a database migration, an rpc version change to the InstanceInfoCache object, etc).

As a partial fix we can backport, we should gracefully handle the connection error when trying to talk to neutron API and return a more appropriate response code (503 maybe?).

Revision history for this message
Matt Riedemann (mriedem) wrote :

The error handling portion of this is related to bug 1571722.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Per comment 4, actually, python-neutronclient doesn't handle the ConnectFailure exception that is coming from keystoneauth, so nova would have to handle that, if we wanted to handle it, which gets kind of ugly.

Revision history for this message
Sreenivas (sreenivas-pothukanoori) wrote :

As currently I am working on one Liberty bug, I dont want to upgrade my openstack setup from Liberty to Newton. Once I upgrade my setup to Newton and if the bug is not assigned to anyone, ill pick and work on this bug.

releasing this bug from my name.

Changed in nova:
status: In Progress → Confirmed
Changed in nova:
assignee: Sreenivas (sreenivas-pothukanoori) → nobody
Prateek Arora (parora)
Changed in nova:
assignee: nobody → Prateek Arora (parora)
Revision history for this message
Prateek Arora (parora) wrote :

The way I am approaching this is I am using a wrapper in the neutron API to catch the exception coming from keystoneauth and then raising it as a separate exception.

As Matt said the patch would be similar to https://bugs.launchpad.net/bugs/1571722 .

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327706

Revision history for this message
Sean Dague (sdague) wrote :

The current patch does a lot of hand holding around doing the error translation differently. That doesn't really feel like it addresses the real issue, and I'm not sure is worth the complexity to have to explain this to people. Stuff still doesn't work. It just now doesn't work in a different manner that needs to be documented.

-1 on returning a different error. Please lets just fix the crux of the issue and keep security groups cached in some sane way.

Revision history for this message
Matt Riedemann (mriedem) wrote :

From IRC conversation this morning:

(7:55:59 AM) mriedem: prateek: there is a comment in the bug report about adding a new column to the instance_info_caches table for serialized security groups
(7:56:34 AM) mriedem: we'd basically be serializing and deserializing SecurityGroupList for a given instance.security_groups into that table
(7:56:54 AM) mriedem: on a GET if it's not there we'd have to pull the security groups from neutron, and then store it in the cache,
(7:57:02 AM) mriedem: and there is code to 'heal' that cache,
(7:57:16 AM) mriedem: and there is a 'refresh_cache' decorator in the network API code for updating that whenever we change network resources tied to an instance
(7:57:22 AM) mriedem: it's definitely not trivial
(7:57:43 AM) mriedem: especially because the network_api and security_group_api are in different interfaces
(7:58:15 AM) mriedem: the network api would just have a handle to the secgroup api in order to refresh that cache i suppose
(7:58:35 AM) mriedem: anyway, it's probably more than a bug, it's probably a bp/spec given the various considerations for getting this right

--

Net is, we probably need a blueprint and a spec for this given the complexity in the change to start caching the security groups for neutron in the instance_info_cache table.

Revision history for this message
Sean Dague (sdague) wrote :

To follow on, I was thinking that there might be a reasonable intermediary solution. I think changing a 500 -> 504 is kind of pointless for the user. They still don't get any useful data.

That being said... we used to have a similar sub select issue on some part of the detailed server list. And what we did was skip over the rows that were deleted. It's a little more complicated because this is another rest service, however I think that building a standard of returning something like a 206 and just marking out the security_groups as 'UNKNOWN' would be ok here. Because 95% of the information will be accessible, but we can indicate this is not perfect information, but it might be good enough for your use case.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

What's puzzling is that this is showing in the gate every now and then. I don't believe there's someone stopping neutron server during the tests.

http://logs.openstack.org/77/401377/2/gate/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/078fa95/logs/screen-n-api.txt.gz?level=TRACE#_2016-11-24_02_21_30_203

Revision history for this message
Prateek Arora (parora) wrote :

I get a file not found when i try to open the above link

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/327706
Reason: This patch has been sitting unchanged for more than 12 weeks. I am therefore going to abandon it to keep the nova review queue sane. Please feel free to restore the change if you're still working on it.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Prateek Arora (parora) → nobody
Sean Dague (sdague)
tags: added: 500-error
Revision history for this message
Matt Riedemann (mriedem) wrote :

Added the performance tag to this because even when neutron is up, when listing a large number of servers, we're getting ports for each server to get the security group information for each server, which is a lot of traffic to neutron. We should be smarter about this and cache the security group information per server like we do in the instance network info_cache for ports. We could heal the cache during the normal _heal_instance_info_cache periodic task.

tags: added: performance
Revision history for this message
Matt Riedemann (mriedem) wrote :

Discussed at the Stein PTG:

https://etherpad.openstack.org/p/nova-ptg-stein

And we need to figure out if neutron sends a network-changed event to nova in the case of security group changes on the ports attached to a server. If not, then we probably need to add that to neutron.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.