'Socket closed' during API stress test

Bug #797770 reported by Brian Lamar
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Chris Behrens

Bug Description

It looks like when making a large number of simultaneous requests the API fails eventually with the following trace:

http://paste.openstack.org/show/1644/

It looks like there are some performance/timing issues with the Nova RPC ConnectionPool when > FLAGS.rpc_conn_pool_size connections are being attempted. We need to find a way to block and not error. This is a blind guess as to the real problem.

Alex Meade (alex-meade)
Changed in nova:
assignee: nobody → Titan (rackspace-titan)
status: New → In Progress
Thierry Carrez (ttx)
Changed in nova:
importance: Undecided → High
Revision history for this message
Thierry Carrez (ttx) wrote :

@titan team: are you actually working on that ? If not, maybe unassign you to potentially let someone else have a shot at it ?

Gabe Westmaas (westmaas)
Changed in nova:
assignee: Titan (rackspace-titan) → nobody
Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Confirmed
Revision history for this message
Chris Behrens (cbehrens) wrote :

I've done a lot of thinking about this one. The only cause can be that the greenthread that has the socket calls to rabbit... is not being scheduled in time to complete the AMQP protocol negotiation. The only things that could cause *that* would be a bug in eventlet or something blocking. It really must be the latter. The only things that could cause *that* would be: too much CPU spinning in python or something else blocking. We're seeing this in stress testing, and there appears to be plenty of CPU available. So, something is probably blocking?

Looking at what might be blocking, with help from a discussion with eday, we've determined that sqlalchemy is using a mysql engine that uses a C library to make socket calls. eventlet cannot wrap this. This means that all mysql queries block until finished. If mysql queries take some time for some reason, this can cause delayed greenthread scheduling.

Potential solutions are:

1) Find and use a mysql engine for sqlachemy that uses python sockets. (Does one exist?)
2) Re-architect the DB layer to use a pool of threads for DB calls... assuming the C code will unlock the GIL during socket operations.

Chris Behrens (cbehrens)
Changed in nova:
assignee: nobody → Chris Behrens (cbehrens)
Revision history for this message
Soren Hansen (soren) wrote :

Is this still a problem?

Revision history for this message
Yuriy Taraday (yorik-sar) wrote :

I've managed to reproduce it on current trunk. Affects only impl_carrot.

Easiest setup i found to do it:
1. Set --volume_driver=nova.volume.driver.FakeISCSIDriver and --rpc_backend=nova.rpc.impl_carrot
2. Create project "proj1" with admin "admin"
3. Run nova-api, nova-scheduler and nova-volume
4. Burn it with:
echo -n '{"volume": {"snapshot_id": null, "display_name": null, "display_description": null, "size": 1}}' > create-vol.json
ab -H "x-auth-project-id: proj1" -H "x-auth-token: admin:proj1" -p create-vol.json -T 'application/json' -c 10 -n 1000 -v 4 http://127.0.0.1:8774/v1.1/proj1/os-volumes

nova-scheduler fails with exactly the same error in no time. May be, we should remove carrot support once and forever?

Revision history for this message
Andy Chong (andycjw) wrote :

@Yuriy Taraday, I can't reproduce it on the latest trunk, does this mean it's already fixed?

Revision history for this message
David Kranz (david-kranz) wrote :

I saw this on Friday running the Tempest/stress test create_kill.py. This was using the Ubuntu packages from last week which use kombu, not carrot. Here is the stacktrace: http://paste.openstack.org/show/12146/

Revision history for this message
David Kranz (david-kranz) wrote :

It is very important to fix this bug. I am sure there are many more bugs to be found through stress testing nova, but this issue really makes it difficult to do that properly. It requires stress tests to throttle themselves which makes it harder for the stress to penetrate to all the parts of the system that it should in order to find race conditions.

Revision history for this message
David Kranz (david-kranz) wrote :

I also saw that a hack was put into this code to handle some similar issues since this bug was reported. Perhaps there could be a similar hack to deal with this issue. Obviously neither of the ideas in comment #2 would be acceptable for Essex.

Changed in nova:
milestone: none → essex-rc2
status: Confirmed → Triaged
Revision history for this message
Chris Behrens (cbehrens) wrote :

It's unfortunate that 'IOError' is raised instead of a socket.* exception. I guess what I would try first is to add 'IOError' to the other exceptions we catch for retries.

I'll get a review up in a few that we can try.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Also: I marked carrot as deprecated recently... so that we can remove it post-essex. Although, i'd be very happy to remove it today if I could. :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5967

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Chris Behrens (cbehrens) wrote :

David: If it's easy for you to reproduce, are you able to try the patch in the above review to see what kind of affect this has?

Revision history for this message
David Kranz (david-kranz) wrote :

I will do it now.

Revision history for this message
David Kranz (david-kranz) wrote :

Unfortunately this did not seem to fix the problem. I did a clean baremetal install without this fix and saw the bug. I then did it again but applying the patch to impl_kombu.py, deleting the .pyc file and restarting nova-network and nova-compute. I still saw the bug.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Hmmm. Same exact traceback?

Revision history for this message
Chris Behrens (cbehrens) wrote :

David: From what I can tell, the code I changed should definitely fix the issue in your paste above. I added the catching of IOError in 2 places... 1 of them being around the call to self._connect() which is what ends up getting the IOError exception in your paste here:

http://paste.openstack.org/show/12146/

If you're still seeing something similar, I'd love to see a new trace! thnx!

Revision history for this message
Thierry Carrez (ttx) wrote :

This bug has been around forever, and the week before release is clearly not the best moment to fix it. I think it should be documented as a known (Carrot-specific?) issue.

Revision history for this message
David Kranz (david-kranz) wrote :

Ugh. Chris, I can't disagree with you but I tried it again and got the same backtrace. I don't see how this could be happening with your fix. The only explanation is that

1. I failed to apply the patch and run the new code (twice)
2. Something else strange is going on

Unfortunately the patch was after the throw point so the line numbers would not have changed. I am going to try this one
more time, but adding more DEBUG output to find out for sure.

Revision history for this message
David Kranz (david-kranz) wrote :
Download full text (5.3 KiB)

Well, I think it is (2). I added a few extra lines before the changed code and put a 'caught' message in the except clause that is now supposed to be catching IOError. And it is. But the same traceback still shows up (except for the slightly different line number from this file). Beat's me what is going on.

32923eefca3c425f32] Caught Socket closed from (pid=2700) reconnect /usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py:452
2012-03-30 15:32:46 ERROR nova.rpc.common [req-69a7c6c3-8d7e-4077-8ab5-fa3d8fbdd538 123634b20bc14ddf99e98bec730065e7 d1e6bada36404f\
32923eefca3c425f32] AMQP server on 172.18.0.146:5672 is unreachable: Socket closed. Trying again in 1 seconds.
(nova.rpc.common): TRACE: Traceback (most recent call last):
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 449, in reconnect
(nova.rpc.common): TRACE: self._connect()
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 423, in _connect
(nova.rpc.common): TRACE: self.connection.connect()
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 154, in connect
(nova.rpc.common): TRACE: return self.connection
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 560, in connection
(nova.rpc.common): TRACE: self._connection = self._establish_connection()
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 521, in _establish_connection
(nova.rpc.common): TRACE: conn = self.transport.establish_connection()
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 255, in establish_connection
(nova.rpc.common): TRACE: connect_timeout=conninfo.connect_timeout)
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 52, in __init__
(nova.rpc.common): TRACE: super(Connection, self).__init__(*args, **kwargs)
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py", line 135, in __init__
(nova.rpc.common): TRACE: (10, 10), # start
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 95, in wait
(nova.rpc.common): TRACE: self.channel_id, allowed_methods)
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/connection.py", line 202, in _wait_method
(nova.rpc.common): TRACE: self.method_reader.read_method()
(nova.rpc.common): TRACE: File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/method_framing.py", line 221, in read_method
(nova.rpc.common): TRACE: raise m
(nova.rpc.common): TRACE: IOError: Socket closed
(nova.rpc.common): TRACE:
2012-03-30 15:32:46 DEBUG nova.rpc.common [req-842419d5-6fe6-4e99-b830-ec6112265362 123634b20bc14ddf99e98bec730065e7 d1e6bada36404f\
32923eefca3c425f32] Caught Socket closed from (pid=2700) reconnect /usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py:452
2012-03-30 15:32:46 ERROR nova.rpc.common [req-842419d5-6fe6-4e99-b830-ec6112265362 123634b20bc14ddf99e98bec730065...

Read more...

Revision history for this message
David Kranz (david-kranz) wrote :

Attaching whole log in case any one cares.

Revision history for this message
David Kranz (david-kranz) wrote :

Here is the actual diff:

--- /usr/share/pyshared/nova/rpc/impl_kombu.py 2012-03-20 07:05:43.000000000 -0400
+++ impl_kombu.py 2012-03-30 15:43:04.821590906 -0400
@@ -440,12 +440,16 @@
         """

         attempt = 0
+
+
+
         while True:
             attempt += 1
             try:
                 self._connect()
                 return
- except self.connection_errors, e:
+ except (self.connection_errors, IOError), e:
+ LOG.debug("Caught %s" % str(e))
                 pass
             except Exception, e:
                 # NOTE(comstud): Unfortunately it's possible for amqplib
@@ -488,7 +492,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.connection_errors, socket.timeout), e:
+ except (self.connection_errors, socket.timeout, IOError), e:
                 pass
             except Exception, e:
                 # NOTE(comstud): Unfortunately it's possible for amqplib

Revision history for this message
Chris Behrens (cbehrens) wrote :

wow, that's weird. Ok, I'll study this further.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Oh. These exceptions will still get logged. IOError is definitely being caught. And it does look like it's attempting to reconnect due to this having been logged:

"32923eefca3c425f32] AMQP server on 172.18.0.146:5672 is unreachable: Socket closed. Trying again in 1 seconds."

I'd like to see the very end of the log... if there's anything more. Is it possible it finally retried and succeeded despite the logging of exceptions?

Revision history for this message
David Kranz (david-kranz) wrote :

I attached the whole log. It is of course possible that it eventually succeeded. I won't be able to check until Monday.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (milestone-proposed)

Fix proposed to branch: milestone-proposed
Review: https://review.openstack.org/6020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/5967
Committed: http://github.com/openstack/nova/commit/b893608b77dac11fceb0b0392f8f908de586970b
Submitter: Jenkins
Branch: master

commit b893608b77dac11fceb0b0392f8f908de586970b
Author: Chris Behrens <email address hidden>
Date: Thu Mar 29 18:44:16 2012 +0000

    Make kombu failures retry on IOError

    Fixes bug 797770

    Unfortunately if rabbit decides protocol negotiation is taking too long,
    it will close the socket on us. This ends up raising IOError with a
    'socket closed' message. This patch will catch IOError and re-try.

    Change-Id: I9110c845b71118c0fad760d90e91c585e6db46ed

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (milestone-proposed)

Reviewed: https://review.openstack.org/6020
Committed: http://github.com/openstack/nova/commit/0d669f107f827e6a227a3ed5d4f3ba338a661a45
Submitter: Jenkins
Branch: milestone-proposed

commit 0d669f107f827e6a227a3ed5d4f3ba338a661a45
Author: Chris Behrens <email address hidden>
Date: Thu Mar 29 18:44:16 2012 +0000

    Make kombu failures retry on IOError

    Fixes bug 797770

    Unfortunately if rabbit decides protocol negotiation is taking too long,
    it will close the socket on us. This ends up raising IOError with a
    'socket closed' message. This patch will catch IOError and re-try.

    Change-Id: I9110c845b71118c0fad760d90e91c585e6db46ed

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
David Kranz (david-kranz) wrote :

I am a little confused here. I think the run with this fix did eventually succeed in spite of having logged this error. But I would expect it to be considered a bug if any ERROR and traceback shows up in a log file when that behavior is expected. Do the nova "powers that be" disagree with that?

Revision history for this message
Vish Ishaya (vishvananda) wrote : Re: [Bug 797770] Re: 'Socket closed' during API stress test

David: I agree. I wanted to get the retry merged so it could make it into essex, but if you can verify that it does indeed solve the issue, then we should remove the exception traceback and just log a debug or an info saying that rabbit dropped the connection and we are retrying.

Revision history for this message
David Kranz (david-kranz) wrote :

OK. To really test this I need to remove the error and traceback. How do I do that? I was actually surprised this still showed up even when the exception was being caught.

Revision history for this message
David Kranz (david-kranz) wrote :

I have done a bunch more testing and this issue seems to be resolved. It would be helpful to change the exception to debug log so people don't report this as a bug in the future. I actually tried to change the LOG.exception to LOG.debug and it didn't do anything which confirms the fact that I don't understand where the exception message is coming from. I was also confused since the doc for logging.exception says it should only be used in an exception handler whereis the call in this code is not.

Revision history for this message
Chris Behrens (cbehrens) wrote :

The calls are after an 'except:', but they happen to be outside of the except: block, which I suppose can be confusing. Note that in the case where an exception is not raised, a 'return' is done (line 447) before the logging points. So, you can only possibly reach the block after the try:/except: if an exception was raised.

You can change all occurrences of LOG.exception() to LOG.debug() or LOG.info() in reconnect() (there's 2.. one is around line 466, the other around line 482).

Revision history for this message
Chris Behrens (cbehrens) wrote :

Just a side note: I think I left the exceptions logged just so we could see exactly what conditions would cause a reconnect.... but it is definitely more appropriate to log them w/ DEBUG if we want to see them. I can work on a patch for that soon.

Thierry Carrez (ttx)
Changed in nova:
milestone: essex-rc2 → 2012.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.