Handle database disconnections gracefully

Bug #94986 reported by James Henstridge
10
Affects Status Importance Assigned to Milestone
Storm
Fix Released
High
James Henstridge

Bug Description

For any application that keeps database connections open for extended periods of time, there is a risk that the connection will be broken (e.g. database server reboots, network issues, etc). It would be nice if Storm could help applications respond appropriately to such situations.

In the case of a database connection failure, I'd expect storm to act the same as if the transaction has been doomed. That is, all requests fail until we abort() the transaction. On the start of the next transaction, there are two possibilities:

1. retry connection to the db until we can connect (possibly with exponential back off), and then proceed with the new transaction.
2. try and connect to the db again. If we fail, act as if this transaction has also been doomed.

I think there is value in providing both these modes: (1) is good for scripts where we want to get the job done eventually, while (2) is better for interactive uses like web applications where leaving the user hanging for minutes is as bad or worse than erroring out early.

Revision history for this message
Steve Alexander (stevea) wrote :

For web application requests, we already raise a Retry exception in the case of serialization or conflict errors. This causes the web application publisher to retry the entire request up to five times. On the sixth time, an error is presented to the user. We should do the same for database disconnections: doom the transaction, and make it raise a Retry to the publisher.

Revision history for this message
James Henstridge (jamesh) wrote :

The attached branch is a work in progress port of the database reconnection code we have in Launchpad. It is currently lacking test coverage (I guess I'll port over the twistd portforward based disconnection tests from Launchpad).

It does not behave quite correctly with psycopg (released versions, or subversion head) as the code to catch this error is commented out (see http://www.initd.org/tracker/psycopg/ticket/186). It shouldn't be too much trouble getting it to work for mysql if someone wants to have a go.

Changed in storm:
assignee: nobody → jamesh
importance: Undecided → High
status: New → Confirmed
Revision history for this message
James Henstridge (jamesh) wrote :

I've updated the branch to a point where it could be considered for merge.

A few notes on the implementation:

1. The branch does not include any logic to retry connections, instead opting to make it easy to detect a disconnection through the DisconnectionError exception. It is expected that an app would retry DisconnectionErrors in the same way they'd retry IntegrityErrors.

2. It only includes a disconnection handling for Postgres, but it should be trivial to support other backends. The tests seem to pass with the psycopg2 in Gutsy, but I still run foul of the bug mentioned in the previous comment when testing interactively.

3. It could probably do with some mock object tests for the db independent portion of the code -- currently it only has highlevel disconnection tests.

Changed in storm:
status: Confirmed → In Progress
Revision history for this message
Jamu Kakar (jkakar) wrote :
Download full text (12.6 KiB)

[1]

There are some trivial conflicts when merging this into trunk:

conflicts:
  Text conflict in storm/databases/postgres.py
  Text conflict in tests/databases/postgres.py

[2]

tests/databases/proxy.py is missing a copyright header.

[3]

+import threading
+
+TIMEOUT = 0.1

Please add an extra blank line before TIMEOUT to follow PEP-8
conventions.

[4]

+ SocketServer.BaseRequestHandler.__init__(
+ self, request, client_address, server)

Is there a reason not to use super(...) here?

[5]

+ self.request.shutdown(socket.SHUT_WR)
+
+class ProxyTCPServer(SocketServer.ThreadingTCPServer):

Please use two blank lines between class definitions to follow PEP-8
conventions.

[6]

Please replace single-quotes with double-quotes to follow existing
conventions in the code.

[7]

+ """Check whether the given exception value represents a
+ database disconnection.

Please condense this comments to a single line or start it on a new
line below the opening """ to match PEP-8 and existing conventions.

[8]

from storm.uri import URI

is missing in tests/database/postgres.py. This causes
PostgresDisconnectionTest.create_database_and_proxy to break.

[9]

I get the following errors when running the tests. The tests in trunk
pass with the same PostgreSQL database, so I'm assuming the problem is
related to changes in this branch. I haven't looked too closely, but
it looks like this is related to the changes you made to the way
Connection is created.

$ trial tests.databases.postgres.PostgresDisconnectionTest
/home/jkakar/src/universe/zope3/3.4.0a1/src/zope/interface/interface.py:206: RuntimeWarning: Python C API version mismatch for module _zope_interface_coptimizations: This Python has API version 1013, module _zope_interface_coptimizations has version 1012.
  import _zope_interface_coptimizations
Running 6 tests.
tests.databases.postgres
  PostgresDisconnectionTest
    test_catch_disconnect_on_commit ... [ERROR]
    test_catch_disconnect_on_execute ... ----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 36799)
Traceback (most recent call last):
  File "SocketServer.py", line 464, in process_request_thread
    self.finish_request(request, client_address)
  File "SocketServer.py", line 254, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/jkakar/src/canonical/storm/review/tests/databases/proxy.py", line 17, in __init__
    self, request, client_address, server)
  File "SocketServer.py", line 522, in __init__
    self.handle()
  File "/home/jkakar/src/canonical/storm/review/tests/databases/proxy.py", line 41, in handle
    self.request.send(chunk)
error: (32, 'Broken pipe')
----------------------------------------
                               [ERROR]
    test_catch_disconnect_on_reconnect ... ----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 33746)
Traceback (most recent call last):
  File "SocketServer.py", line 464, in process_request_thread
    self.finish_request(request, client_address)
  File "SocketServer....

Revision history for this message
Jamu Kakar (jkakar) wrote :

Oops. The comment I made in point #9:

I haven't looked too closely, but it looks like this is related to the
changes you made to the way Connection is created.

should have been part of point #10.

Revision history for this message
Jamu Kakar (jkakar) wrote :

Also, in case it's relevant I've tried to run these tests on an
up-to-date gutsy system.

Revision history for this message
James Henstridge (jamesh) wrote :

[1]: I've merged trunk and fixed the conflicts. All the tests run for me locally, both with stock psycopg2 from gutsy and the version patched to correctly report disconnections.

(I still haven't worked out why this one doesn't affect the tests but does affect manual testing).

[2], [3]: fixed

[4]: ProxyRequestHandler is an old style class, so super would give a "TypeError: super() argument 1 must be type, not classobj" error.

[5], [6], [7]: fixed

[8]: looks like this import was removed on trunk. I've added it back.

[9]: the reconnection tests try to connect to the database via a localhost TCP socket rather than the unix domain socket. You might need to adjust pg_hba.conf to let this through.

[10]: I haven't seen this one.

[11]: All the Postgres tests depend on the database being created before hand, and not dropping it. PostgresTest doesn't override drop_database() either.

[12], [13]: fixed.

The fixed branch is now pushed.

Revision history for this message
James Henstridge (jamesh) wrote :

Fix committed in r186.

Changed in storm:
status: In Progress → Fix Committed
Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

The attached mysql-reconnection branch adds reconnection support on MySQL.

Changed in storm:
status: Fix Committed → In Progress
Revision history for this message
Christopher Armstrong (radix) wrote :

= GENERAL =

[1] It might be nice to wrap Cursor and Connection objects in something that automatically uses _check_disconnect when being called externally. This would not only allow us to avoid thinking about calling that everywhere we use one of these, but also allow us to avoid accessing a "_" method semi-externally.

[2] It looks like some ' strings got into trunk with the first branch.

[3] The message checking in the postgres code is nasty. As we discovered, psycopg2 has some dumb bugs where it truncates initial text.

     /* try to remove the initial "ERROR: " part from the postgresql error */
     if (err && strlen(err) > 8) err2 = &(err[8]);
     else err2 = err;

This causes all kinds of problems. There's actually another case that has been spotted in the wild but of which I'm not sure of the cause.

<glyph> psycopg2.OperationalError: nection to the server

The truncation that's happening here doesn't match up with the length of "WARNING: " or "ERROR: ".

[5] Can you think of a good place to document DisconnectionError? Maybe we should add it to the docstring of the Connection class, saying that basically any method that talks to the DB might raise it.

= MYSQL =

hmm.. nothing. Looks good! +1!

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

[1]

It'd be cool to prevent spreading the checking around indeed. I don't know how
we could do that cleanly, though. In a certain way, Connection/Result are already
a wrapper to the real Connection/Cursor objects. Introducing another wrapper
might only postpone the same kind of checks to another layer. In any case, if
you have suggestions on how to do it cleanly, please let me know.

[2]

Oops. I think I've fixed them all now.

[3]

Yeah. :-(

[5]

I've now added a "@raise DisconnectionError: ..." to all affected methods in database.py.

Does that sound ok?

Thanks for the review!

Revision history for this message
Christopher Armstrong (radix) wrote :

[1] Hm. I think I misread the code originally. For some reason I thought it was being spread between database backends. It looks fine on the second look.

Ok, great! +1!

Revision history for this message
Jamu Kakar (jkakar) wrote :

These comments are for ~niemeyer/storm/mysql-reconnection

[1]

+ def is_disconnection_error(self, exc):
+ # http://dev.mysql.com/doc/refman/5.0/en/gone-away.html
+ return (isinstance(exc, OperationalError) and
+ exc.args[0] in (2006, 2013)) # (SERVER_GONE_ERROR, SERVER_LOST)

These error codes, 2006 and 2013, are defined in MySQLdb.constants.CR
with the SERVER_GONE_ERROR and SERVER_LOST constants. They should be
used instead of hard-coded integers.

[2]

+ msg = str(exc)
         return (msg.startswith('server closed the connection unexpectedly') or
                 msg.startswith('could not connect to server') or
                 msg.startswith('no connection to the server') or

This wasn't added in this branch, but the single-quotes here should be
double-quotes.

+1 considering #1. This is a very nice and straight-forward branch.

Revision history for this message
Jamu Kakar (jkakar) wrote :

I just realized that the reason for hard-coded integers (see my #1
comment above) is probably to avoid needing to import from MySQLdb,
which may or may not be present. If that's the case it's probably
easier/cleaner to keep the code the way it is rather than jump through
conditional import hoops. Those values are unlikely to ever change
anyway.

Revision history for this message
James Henstridge (jamesh) wrote :

For reference, new versions of psycopg2 won't truncate the exception messages when they don't have a severity at the beginning.

Revision history for this message
James Henstridge (jamesh) wrote :

Marking as fix released. The infrastructure is in place, and support code for postgresql and mysql is merged.

Changed in storm:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.