Timeout from API with 50 Simultaneous Builds

Bug #771512 reported by Antony Messerli
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Chris Behrens
eventlet
In Progress
Undecided
Unassigned
python-eventlet (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Running cactus, I attempted a build of 50 VMs at once via the API and ran into a timeout where 12 of the 50 VMs timed out and are stuck in build. The remaining went active.

for i in `seq 1 50`; do nova boot test --flavor 1 --image 10 & > /dev/null ; echo $i ; done

API Limits were raised well past this in limits.py for stress testing.

2011-04-26 17:36:51,852 ERROR nova.api.openstack [-] Caught error: timed out
(nova.api.openstack): TRACE: Traceback (most recent call last):
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/api/openstack/__init__.py", line 59, in __call__
(nova.api.openstack): TRACE: return req.get_response(self.application)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/request.py", line 919, in get_response
(nova.api.openstack): TRACE: application, catch_exc_info=False)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/request.py", line 887, in call_application
(nova.api.openstack): TRACE: app_iter = application(self.environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
(nova.api.openstack): TRACE: return resp(environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
(nova.api.openstack): TRACE: return resp(environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
(nova.api.openstack): TRACE: return resp(environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/routes/middleware.py", line 131, in __call__
(nova.api.openstack): TRACE: response = self.app(environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
(nova.api.openstack): TRACE: return resp(environ, start_response)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 147, in __call__
(nova.api.openstack): TRACE: resp = self.call_func(req, *args, **self.kwargs)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 208, in call_func
(nova.api.openstack): TRACE: return self.func(req, *args, **kwargs)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/wsgi.py", line 356, in __call__
(nova.api.openstack): TRACE: result = method(**arg_dict)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/api/openstack/servers.py", line 171, in create
(nova.api.openstack): TRACE: injected_files=injected_files)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/compute/api.py", line 256, in create
(nova.api.openstack): TRACE: "injected_files": injected_files}})
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/rpc.py", line 394, in cast
(nova.api.openstack): TRACE: publisher = TopicPublisher(connection=conn, topic=topic)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/rpc.py", line 241, in __init__
(nova.api.openstack): TRACE: super(TopicPublisher, self).__init__(connection=connection)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/messaging.py", line 653, in __init__
(nova.api.openstack): TRACE: self.declare()
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/messaging.py", line 664, in declare
(nova.api.openstack): TRACE: auto_delete=self.auto_delete)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/backends/pyamqplib.py", line 240, in exchange_declare
(nova.api.openstack): TRACE: return self.channel.exchange_declare(exchange=exchange,
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/backends/pyamqplib.py", line 179, in channel
(nova.api.openstack): TRACE: self._channel_ref = weakref.ref(self.connection.get_channel())
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/connection.py", line 150, in get_channel
(nova.api.openstack): TRACE: return self.connection.channel()
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/connection.py", line 120, in connection
(nova.api.openstack): TRACE: self._connection = self._establish_connection()
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/connection.py", line 133, in _establish_connection
(nova.api.openstack): TRACE: return self.create_backend().establish_connection()
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/carrot/backends/pyamqplib.py", line 195, in establish_connection
(nova.api.openstack): TRACE: connect_timeout=conninfo.connect_timeout)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/connection.py", line 125, in __init__
(nova.api.openstack): TRACE: self.transport = create_transport(host, connect_timeout, ssl)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/transport.py", line 220, in create_transport
(nova.api.openstack): TRACE: return TCPTransport(host, connect_timeout)
(nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/amqplib/client_0_8/transport.py", line 58, in __init__
(nova.api.openstack): TRACE: self.sock.connect((host, port))
(nova.api.openstack): TRACE: File "/usr/lib/python2.6/dist-packages/eventlet/greenio.py", line 178, in connect
(nova.api.openstack): TRACE: timeout_exc=socket.timeout("timed out"))
(nova.api.openstack): TRACE: File "/usr/lib/python2.6/dist-packages/eventlet/hubs/__init__.py", line 121, in trampoline
(nova.api.openstack): TRACE: return hub.switch()
(nova.api.openstack): TRACE: File "/usr/lib/python2.6/dist-packages/eventlet/hubs/hub.py", line 177, in switch
(nova.api.openstack): TRACE: return self.greenlet.switch()
(nova.api.openstack): TRACE: timeout: timed out
(nova.api.openstack): TRACE:

Revision history for this message
Soren Hansen (soren) wrote :

I've managed to reproduce this locally by stubbing out quotas, ratelimiting, and the virt driver.

Changed in nova:
status: New → Confirmed
Revision history for this message
Soren Hansen (soren) wrote :

My current working hypothesis is that something in Nova is blocking. This causes the connect_timeout to trigger (because once eventlet context switches back into the main greenthread (or whatever it is that it does)), the timeout has expired without being cancelled. This would explain the log message in rabbit saying that the connection was abruptly terminated (meaning the connection was actually established, in spite of Nova saying it timed out).

If that's indeed the case, yes, we shouldn't be blocking, but eventlet also should check if the connection has been established before it kills the connection attempt.

Revision history for this message
Antony Messerli (antonym) wrote :

Full logs from api and scheduler from 50 instance build:

http://paste.openstack.org/show/1271/ - nova-api
http://paste.openstack.org/show/1272/ - nova-scheduler

Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → High
Revision history for this message
Soren Hansen (soren) wrote :

I'm still blaming eventlet for this. Looking at the logs, it looks like the sheer volume of requests simply makes it take more than 5 seconds before the event loop in eventlet gets back to the attempt to establish a connection. We can optimise the time-per-request, but that'll won't solve anything, it'll just hide the problem until we try 100 or 200 requests at a time. Same if we increase the timeout.

Revision history for this message
Vish Ishaya (vishvananda) wrote : Re: [Bug 771512] Re: Timeout from API with 50 Simultaneous Builds
Download full text (6.9 KiB)

Wasn't there some idea that this was because nova-api doesn't monkeypatch all of the libraries?

Vish

On May 4, 2011, at 1:58 PM, Soren Hansen wrote:

> I'm still blaming eventlet for this. Looking at the logs, it looks like
> the sheer volume of requests simply makes it take more than 5 seconds
> before the event loop in eventlet gets back to the attempt to establish
> a connection. We can optimise the time-per-request, but that'll won't
> solve anything, it'll just hide the problem until we try 100 or 200
> requests at a time. Same if we increase the timeout.
>
> --
> You received this bug notification because you are a member of Nova Bug
> Team, which is subscribed to OpenStack Compute (nova).
> https://bugs.launchpad.net/bugs/771512
>
> Title:
> Timeout from API with 50 Simultaneous Builds
>
> Status in OpenStack Compute (Nova):
> Confirmed
>
> Bug description:
> Running cactus, I attempted a build of 50 VMs at once via the API and
> ran into a timeout where 12 of the 50 VMs timed out and are stuck in
> build. The remaining went active.
>
> for i in `seq 1 50`; do nova boot test --flavor 1 --image 10 & >
> /dev/null ; echo $i ; done
>
> API Limits were raised well past this in limits.py for stress testing.
>
> 2011-04-26 17:36:51,852 ERROR nova.api.openstack [-] Caught error: timed out
> (nova.api.openstack): TRACE: Traceback (most recent call last):
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/nova/api/openstack/__init__.py", line 59, in __call__
> (nova.api.openstack): TRACE: return req.get_response(self.application)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/request.py", line 919, in get_response
> (nova.api.openstack): TRACE: application, catch_exc_info=False)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/request.py", line 887, in call_application
> (nova.api.openstack): TRACE: app_iter = application(self.environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
> (nova.api.openstack): TRACE: return resp(environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
> (nova.api.openstack): TRACE: return resp(environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
> (nova.api.openstack): TRACE: return resp(environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/routes/middleware.py", line 131, in __call__
> (nova.api.openstack): TRACE: response = self.app(environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 159, in __call__
> (nova.api.openstack): TRACE: return resp(environ, start_response)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 147, in __call__
> (nova.api.openstack): TRACE: resp = self.call_func(req, *args, **self.kwargs)
> (nova.api.openstack): TRACE: File "/usr/lib/pymodules/python2.6/webob/dec.py", line 208, in call_func
...

Read more...

Revision history for this message
Soren Hansen (soren) wrote :

2011/5/4 Vish Ishaya <email address hidden>:
> Wasn't there some idea that this was because nova-api doesn't
> monkeypatch all of the libraries?

There was. It wasn't.

Revision history for this message
Soren Hansen (soren) wrote :

Reproduced outside of Nova (in accordance with my previous analysis). Reported it upstream:

    https://bitbucket.org/which_linden/eventlet/issue/87/socket-connects-are-incorrectly-reported

I think we can say at this point that the problem is well understood and just needs someone to fix it. Marking Triaged accordingly.

Changed in nova:
status: Confirmed → Triaged
Revision history for this message
Jay Pipes (jaypipes) wrote :

In my opinion, this should be a Critical priority bug...

Revision history for this message
Sandy Walsh (sandy-walsh) wrote :

Additionally (another bug?), OS API really needs to take a num_instances parameter like EC2 so we can get the benefits of the load balancer/scheduler.

Chris Behrens (cbehrens)
Changed in nova:
assignee: nobody → Chris Behrens (cbehrens)
Revision history for this message
Chris Behrens (cbehrens) wrote :

I added a comment to the above bitbucket thread with a potential patch. I'll attach it here as well. Comments are in the patch along with more on the bitbucket thread.

Revision history for this message
Chris Behrens (cbehrens) wrote :

My above patch makes an improvement, but there's still timeouts. Unclear whether the timeouts are real or not yet. However, after talking with soren again, I've decided to try to fix eventlet more correctly by checking if a timeout exception was returned... and then re-trying a non-blocking event once more. If this 2nd non-blocking event fails, it returns original timeout exception. Attached is a start to the patch. My test case passes, but I need to fix the rest of the socket wrapper calls, yet. I've done connect(), recv(), recvfrom() so far. (See attached patch).

Revision history for this message
Chris Behrens (cbehrens) wrote :

Sorry, the connect_ex() modification was wrong in the above. Here's an updated version of patch.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Maybe finished version, though it appears some tests need fixed. Either way, tests are failing in eventlet trunk without this patch.

Revision history for this message
Chris Behrens (cbehrens) wrote :

I have tested the above patch with 50 simultaneous builds and the timeouts are gone.

(Additionally, after testing this and applying the merge prop diff for: https://bugs.launchpad.net/nova/+bug/766404
and a proposed fix for: https://bugs.launchpad.net/nova/+bug/781435

we can now get 50 instances all in the ACTIVE state when building them in parallel. Woo!)

There's another issue when we go to 100 instances, now. Getting lots of 'IOError: Socket closed' when carrot tries to connect to rabbit. Ultimately, I think we need to make some changes to nova-api to make it handle this level of concurrent sessions. I have some ideas, but I want to look at the api code, which I'm not that familiar with, before I make some statements that might be idiotic. :)

Revision history for this message
Thierry Carrez (ttx) wrote :

About Importance: I think "High" is about right, the impact is very serious but does not disable completely the use of the system, corrupts or exposes your data. See bottom of http://wiki.openstack.org/BugsLifecycle for guidelines.

That said, it's very much something we need to fix ASAP, so it should be targeted to a milestone :) Importance/Impact and urgency being two different (though very inter-related) things.

Thierry Carrez (ttx)
Changed in nova:
status: Triaged → In Progress
Revision history for this message
Chris Behrens (cbehrens) wrote :

Linked lp:~cbehrens/nova/rpc-improvements here also. This implements a connection pool for connections to rabbit, as well as changes the 3 Service topic consumers to use a single connection that has a single wait() for events. That eliminates 3 threads that are reach repeatedly polling rabbit every 0.1 seconds.

All of that should:

1) Reduce bombing rabbit with new connections
2) Reduce overall number of connections to rabbit
3) Reduce a lot of context switches
4) Remove possible unnecessary 0.1 second delay in services processing requests.

etc.

Revision history for this message
Andrey Brindeyev (abrindeyev) wrote :

Thanks, Chris!

I've applied your changes and it allowed me run 400 instances on 3 nodes without errors for the first time!

(RHEL6, KVM)

Soren Hansen (soren)
Changed in nova:
status: In Progress → Invalid
Changed in python-eventlet (Ubuntu):
status: New → In Progress
Soren Hansen (soren)
Changed in eventlet:
status: New → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package python-eventlet - 0.9.15-0ubuntu2

---------------
python-eventlet (0.9.15-0ubuntu2) oneiric; urgency=low

  * retry-on-timeout.patch: This addresses a problem where a timeout fires
    even though the connection was actually correctly established. (LP:
    #771512)
  * reuseaddr.patch: Make sure SO_REUSEADDR is set.
  * mysql-import-traceback.patch: Add missing import in the mysql tests.
  * Add python-greenlet dependency.
 -- Soren Hansen <email address hidden> Wed, 25 May 2011 22:02:42 +0200

Changed in python-eventlet (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers