giant race in quantum - quantumclient key errors break creation of guests in nova

Bug #1192131 reported by Sean Dague
70
This bug affects 18 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Russell Bryant
neutron
Invalid
Critical
Mark McClain

Bug Description

The current #1 issue the in gate is a quantum key error which looks to be a race in how quantum responds to requests from nova. This can be seen with a representative error here:

2013-06-17 15:32:13.856 22755 DEBUG quantumclient.client [-] RESP:{'date': 'Mon, 17 Jun 2013 15:32:13 GMT', 'status': '200', 'content-length': '16', 'content-type': 'application/json; charset=UTF-8', 'content-location': 'http://127.0.0.1:9696//v2.0/floatingips.json?fixed_ip_address=10.100.0.2&port_id=fb9461c6-4e82-4fef-a06b-77242289f23d'} {"networks": []}
 http_log_resp /opt/stack/new/python-quantumclient/quantumclient/common/utils.py:179
2013-06-17 15:32:13.856 22755 ERROR nova.compute.manager [-] [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] Instance failed network setup
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] Traceback (most recent call last):
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/compute/manager.py", line 1172, in async_alloc
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] security_groups=security_groups)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/api.py", line 48, in wrapper
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] res = f(self, context, *args, **kwargs)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/quantumv2/api.py", line 285, in allocate_for_instance
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] nw_info = self._get_instance_nw_info(context, instance, networks=nets)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/quantumv2/api.py", line 377, in _get_instance_nw_info
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] nw_info = self._build_network_info_model(context, instance, networks)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/quantumv2/api.py", line 856, in _build_network_info_model
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] network_IPs = self._nw_info_get_ips(client, port)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/quantumv2/api.py", line 785, in _nw_info_get_ips
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] client, fixed_ip['ip_address'], port['id'])
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/nova/nova/network/quantumv2/api.py", line 723, in _get_floating_ips_by_fixed_and_port
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] port_id=port)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/python-quantumclient/quantumclient/v2_0/client.py", line 108, in with_params
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] ret = self.function(instance, *args, **kwargs)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/python-quantumclient/quantumclient/v2_0/client.py", line 441, in list_floatingips
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] **_params)
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] File "/opt/stack/new/python-quantumclient/quantumclient/v2_0/client.py", line 1022, in list
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] res.extend(r[collection])
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3] KeyError: 'floatingips'
2013-06-17 15:32:13.856 22755 TRACE nova.compute.manager [instance: f9327102-6a95-4d14-b262-2992d4ad66f3]

It's very unclear what's going in quantum at this point, as apparently only DEBUG logs are running, but not normal log levels? The fact that this is fixed with a recheck on unrelated code means there is some race in quantum interaction that is not accounted for from nova, and this is a huge issue currently in the gate.

Sean Dague (sdague)
Changed in nova:
importance: Undecided → High
Revision history for this message
Sean Dague (sdague) wrote :
Revision history for this message
Jay Pipes (jaypipes) wrote :

I would say this should be Critical priority...

Changed in nova:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/33499

Changed in nova:
assignee: nobody → Russell Bryant (russellb)
status: New → In Progress
Changed in quantum:
importance: Undecided → Critical
assignee: nobody → Mark McClain (markmcclain)
Revision history for this message
Aaron Rosen (arosen) wrote :

This is a duplicate of another bug (can't find the report right now) but Armando commented on it about this message printed: This is also found the the link Sean posted.

2013-06-17 15:32:13.837 22755 DEBUG quantumclient.v2_0.client [-] Error message: Second simultaneous read on fileno 25 detected. Unless you really know what you're doing, make sure that only one greenthread can read any particular socket. Consider using a pools.Pool. If you do know what you're doing and want to disable this error, call eventlet.debug.hub_multiple_reader_prevention(False) _handle_fault_response /opt/stack/new/python-quantumclient/quantumclient/v2_0/client.py:899

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/33499
Committed: http://github.com/openstack/nova/commit/ee5d9ae8d376e41e852b06488e922400cf69b4ac
Submitter: Jenkins
Branch: master

commit ee5d9ae8d376e41e852b06488e922400cf69b4ac
Author: Russell Bryant <email address hidden>
Date: Tue Jun 18 14:33:29 2013 -0400

    Revert "Delegate authentication to quantumclient"

    This reverts commit dd9c27f999221001bae9faa03571645824d2a681.

    When this patch was merged, we were suspicious about whether it was safe
    to cache the client object and have it used by multiple greenthreads.
    We decided it was safe and merged it. After thinking about it and
    discussing it further, it is really a bad idea. Sharing httplib2
    connections is not considered thread-safe.

    quantumclient.client.HTTPClient inherits from httplib2.Http. The
    following page says sharing instances of this object between threads is
    not safe:

      https://developers.google.com/api-client-library/python/guide/thread_safety

      "The google-api-python-client library is built on top of the httplib2
      library, which is not thread-safe. Therefore, if you are running as a
      multi-threaded application, each thread that you are making requests
      from must have its own instance of httplib2.Http()."

    Potentially fix bug 1192131. Even if it doesn't fix that bug, this
    needs to be reverted anyway.

    Change-Id: I2e4bf5e7b6458cd7b76e30847fe07f06b25c34f7

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Russell Bryant (russellb) wrote :

That patch didn't fix it ...

Here's the next candidate fix: https://review.openstack.org/#/c/33555/

Changed in nova:
status: Fix Committed → Confirmed
Revision history for this message
Chris Behrens (cbehrens) wrote :

This bug is a duplicate of 1185834 which was originally filed on 5/30... which is before the merging of my review for allocating networks in the background (which happened on 6/7).

Revision history for this message
Chris Behrens (cbehrens) wrote :

Strange coincidence on the date here:

----
commit 00f8eb93654681d393117e86fdee43cb9ee989b9
Merge: 013bff3 dd9c27f
Author: Jenkins <email address hidden>
Date: Wed May 29 23:59:38 2013 +0000

    Merge "Delegate authentication to quantumclient"
----

but if it really didn't fix it... Hm. :)

I can tell you that my patch should pretty much be a no-change for libvirt... assuming libvirt still requires the legacy network model. There's synchronization on the call to network_info.legacy() before the call to driver.spawn() for drivers that require the legacy format.

Still searching for more possibilities before 5/30...

Revision history for this message
Hans Lindgren (hanlind) wrote :

FWIW, I've seen some comments about the KeyError bug being related to bug 1189656 where you get something like:

2013-06-10 20:59:40.670 21763 DEBUG quantumclient.client [-] RESP:{'status': '400', 'content-length': 311, 'content-type': 'text/plain'} Second simultaneous read on fileno 20 detected. Unless you really know what you're doing, make sure that only one greenthread can read any particular socket. Consider using a pools.Pool. If you do know what you're doing and want to disable this error, call eventlet.debug.hub_multiple_reader_prevention(False)
 http_log_resp /opt/stack/new/python-quantumclient/quantumclient/common/utils.py:179

I don't think they are the same. Since I originally reported bug 1185834 about the KeyError, I've been monitoring it closely and reported a lot of duplicates as it shows up differently depending on the failed operation (create, rescue, delete). To my knowledge, the threading issue appeared just recently and I have previously seen many n-cpu logs with KeyErrors that isn't affected by it.

From Chris assessment of bug 1189656 (comment #1) I would have thought that this issue was resolved by the recent revert of "Delegate authentication to quantumclient" and that we now still faces the KeyError bug. Can this be the case, or is the threading issue still causing problems?

Sean Dague (sdague)
no longer affects: tempest
Revision history for this message
Russell Bryant (russellb) wrote :

We haven't seen this failure in the last 7 days, so I think we can call it resolved.

Changed in nova:
status: Confirmed → Fix Committed
Changed in quantum:
status: New → Invalid
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-2 → 2013.2
Revision history for this message
harrychillboy (harrychillboy) wrote :

This bus is due to race condition in the neutron client caused by concurrent green threads trying to use same connection. This leads to random Key errors since green threads reading wrong responses.

Attached patch is a actual fix to this issue by not allowing green threads to use same connection object.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@harrychillboy, I think that's already been covered...not sure if you're saying this is still a bug or what? Note that I had a set of 4 patches backported to stable/havana related to this fix (one of those was abandoned since it wasn't passing the check queue):

https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/havana+topic:backport-neutron-auth-fixes,n,z

Revision history for this message
harrychillboy (harrychillboy) wrote :

@Matt Riedemann : We tried most of the fixes mentioned here and were still seeing random issues in neutron client. I had a look at code changes here :
https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/havana+topic:backport-neutron-auth-fixes,n,z

Did not see anything which addresses a race condition which can happen between two eventlet greenthreads which will be given the access to same neutron client at same time. The code protects only from OS threads but has no idea of green threads. This causes nova to fail on various neutron operation when loaded with lot of concurrent requests.

Here is the current code :
if admin:
    if not hasattr(local.strong_store, 'neutron_client'):
          local.strong_store.neutron_client = _get_client(token=None)
    return local.strong_store.neutron_client

The local is thread local storage, has no idea of two green threads living inside the same OS thread. And this is what causes the race condition to use same socket connection for two simultaneous operation.

The patch fixes this by storing admin client per green thread ID so they never get shared among the green threads. This avoid the condition where green thread A is using it for operation X while another green thread B is using it for operation Y.

Revision history for this message
Matt Riedemann (mriedem) wrote :

@harrychillboy - you wanna push a patch up to review.openstack.org for this?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.