DisconnectionErrors (already disconnected) happening again

Bug #504291 reported by Ursula Junque
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
William Grant
Storm
Invalid
Undecided
Unassigned

Bug Description

We had some in 3rd, then a bunch of DisconnectionErrors on 4th, only staging, as we can see in the summaries: https://devpad.canonical.com/~lpqateam/oops-summaries/staging-2010-01-04.html.

elmo reported to have upgraded postgresql on sourcherry (the staging DB server) at 2010-01-03, 23:35, but we don't know if that's the cause of all of them.

Now we can see them on lpnet and edge summaries, not staging:
https://devpad.canonical.com/~lpqateam/oops-summaries/lpnet-2010-01-06.html#exceptions
https://devpad.canonical.com/~lpqateam/oops-summaries/edge-2010-01-06.html#exceptions

Related branches

Ursula Junque (ursinha)
affects: blueprint → launchpad-foundations
Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

Looks as if a storm.Connection gets stuck in STATE_DISCONNECTED ("I'm broken and will stay that way until you roll me back") instead of STATE_RECONNECT ("I'll re-open when you need me again"). We need the latter at the end of a request so that the next request can re-use the connection.

The cause might be this:

canonical.launchpad.webapp.publication.LaunchpadBrowserPublication.afterCall does...

        if request.method in ['GET', 'HEAD']:
            self.finishReadOnlyRequest(txn)
        elif txn.isDoomed():
            txn.abort() # Sends an abort to the database, even though
            # transaction is still doomed.
        else:
            txn.commit()

This amounts to:

    if request.method == "POST" and not txn.isDoomed():
        txn.commit()
    else:
        txn.rollback()

Makes sense to me, but unfortunately storm.database.Connection.commit goes into STATE_DISCONNECTED on failure. Only a rollback can get it back to STATE_RECONNECT. So we'd need to catch disconnection errors coming out of txn.commit() and handle them by calling txn.abort() after all.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

I missed a vital point in my explanation above: what this all boils down to is that AFAICS a connection gets stuck in this bad state if it's broken (e.g. by a database restart) just when it's in the middle of a commit.

The bad connection will stay in the store and get reused by requests, all of which fail with this error. And because they fail quickly, the bad connection becomes available again quickly to carry another request to its doom.

Revision history for this message
Guilherme Salgado (salgado) wrote :

looks just like bug 427990

Revision history for this message
Guilherme Salgado (salgado) wrote :

These DisconnectionErrors cause the transaction to be retried, but as Stuart explained in <https://bugs.edge.launchpad.net/launchpad-foundations/+bug/427990/comments/4>, we now log a soft (Informational) OOPS (meaning we don't fail the request) for DisconnectionErrors, so it's kind of expected to see them if the database was in recovery mode. The only problem is that none of the OOPSes there seem to be soft.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

As for the "broken connection state," recommended practice from the Storm team seems to be to do an abort() at the very beginning of a request, in order to reset the connection.

Revision history for this message
Gary Poster (gary) wrote :

lib/canonical/launchpad/webapp/publication.py, line 167, in beforeTraversal, does a transaction.begin(), as per standard Zope practice. This is within the following traceback:

  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/taskthreads.py", line 48, in handlerThread
    task.service()
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/serverchannelbase.py", line 207, in service
    task.service()
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/http/httptask.py", line 71, in service
    self.channel.server.executeRequest(self)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zc.zservertracelog-1.1.5-py2.5.egg/zc/zservertracelog/tracelog.py", line 114, in executeRequest
    response = self.application(env, start_response)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.app.wsgi-3.6.0-py2.5.egg/zope/app/wsgi/__init__.py", line 58, in __call__
    request = publish(request, handle_errors=handle_errors)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.publisher-3.10.0-py2.5.egg/zope/publisher/publish.py", line 142, in publish
    publication.beforeTraversal(request)
  File "/home/gary/launchpad/lp-branches/devel/lib/canonical/launchpad/webapp/publication.py", line 167, in beforeTraversal
    transaction.begin()

The transaction.begin method performs an abort if a previous transaction exists, as you can see in transaction/_manager.py.

If I understand correctly, then, we are already doing what you are suggesting, Jeroen. Could you confirm or deny?

On the bright side, if someone can identify what *does* need to happen, this point in beforeTraversal would probably be a reasonable place to do it.

Revision history for this message
Gary Poster (gary) wrote :

Next steps:

- Stuart Bishop will be trying to come up with a test to dupe the error condition Jeroen described in his first comment.
- I will verify with James Henstridge and/or Gustavo Niemeyer that transaction.begin's behavior should be sufficient from Storm/ZStorm's perspective.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
Stuart Bishop (stub)
Changed in storm:
status: New → Invalid
Revision history for this message
Stuart Bishop (stub) wrote :

I haven't produced a test case, but I have traced through the logic in ZStorm and the transaction module. If an exception is raised by zstorm.StoreDataManager.commit() during txn.commit(), the transaction machinery will invoke zstorm.StoreDataManager.abort() which rollback on the store, setting the state to RECONNECT if the store is not connected.

I cannot locate a code path where afterCall() will leave the stores in a disconnected state.

Revision history for this message
Stuart Bishop (stub) wrote :

afterCall() only gets called when things are succeeding. If an exception is raised, we go to handleException which is much more complex and relies on a fair bit of code not raising exceptions before the transaction.abort() is finally called.

I think the best way to proceed is add cope *at the start* of endRequest in canonical.launchpad.webapp.publication to detect stores in a disconnected state and call rollback() on them, raising an exception or logging a soft oops. This should let us track down the code path that leaves our stores in a disconnected state.

Stuart Bishop (stub)
Changed in launchpad-foundations:
status: Triaged → In Progress
Changed in launchpad-foundations:
assignee: nobody → Stuart Bishop (stub)
Revision history for this message
Diogo Matsubara (matsubara) wrote : Bug fixed by a commit
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Ursula Junque (ursinha)
Changed in launchpad-foundations:
milestone: none → 10.01
Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
Revision history for this message
Robert Collins (lifeless) wrote :

We currently have this in the code base:
            try:
                assert store._connection._state != STATE_DISCONNECTED, (
                    "Bug #504291: Store left in a disconnected state.")
            except AssertionError:
                # The Store is in a disconnected state. This should
                # not happen, as store.rollback() should have been called
                # by now. Log an OOPS so we know about this. This
                # is Bug #504291 happening.
                getUtility(IErrorReportingUtility).raising(
                    sys.exc_info(), request)
                # Repair things so the server can remain operational.
                store.rollback()

in endRequest: the code path that leads to disconnected stores is still live. This bug is merely worked-around. I'm going to reopen it as we're trying to get a real fix in place as part of the fastdowntime work.

Changed in launchpad:
status: Fix Released → Triaged
tags: added: fastdowntime
Revision history for this message
Robert Collins (lifeless) wrote :

One easy fix would be to change the code to this:
           if store._connection._state == STATE_DISCONNECTED:
                # Repair things so the server can remain operational.
                store.rollback()

which would be a statement that we expect this to happen from time to time, and that when it happens we can deal with it.

Revision history for this message
Robert Collins (lifeless) wrote :

Recent occurence:
OOPS-2077A100

Revision history for this message
Stuart Bishop (stub) wrote :

The code is already there to reset the store to a working state, done straight after logging the OOPS:

                # The Store is in a disconnected state. This should
                # not happen, as store.rollback() should have been called
                # by now. Log an OOPS so we know about this. This
                # is Bug #504291 happening.
                getUtility(IErrorReportingUtility).raising(
                    sys.exc_info(), request)
                # Repair things so the server can remain operational.
                store.rollback()

I'm not sure how to proceed on this. At the end of the request, we are detecting the invalid state and logging an OOPS. This OOPS (such as OOPS-2077A100) is empty, containing no call traces or information on how this request managed to screw up the database connection.

Revision history for this message
Gavin Panella (allenap) wrote :

I landed some disconnection related changes to Storm recently, which
are now also running in Launchpad. I can't find any recent occurrences
of this bug. Perhaps it's gone away?

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 504291] Re: DisconnectionErrors (already disconnected) happening again

On Wed, Nov 9, 2011 at 10:57 PM, Stuart Bishop
<email address hidden> wrote:
> I'm not sure how to proceed on this. At the end of the request, we are
> detecting the invalid state and logging an OOPS. This OOPS (such as
> OOPS-2077A100) is empty, containing no call traces or information on how
> this request managed to screw up the database connection.

So, I guess we need to fix the oops. Is this perhaps logged after the
request is 'finished' (which wipes the oops metadata?)

Revision history for this message
Stuart Bishop (stub) wrote :

The OOPS is certainly logged after the request is finished. This is in endRequest(), after the request and any error handling has been completed.

This is all tied up with error handling I think.

                    try:
                        try:
                            request.processInputs()
                            publication.beforeTraversal(request)
                            [...]
                            publication.afterCall(request, obj)
                        except:
                            exc_info = sys.exc_info()
                            publication.handleException(
                                obj, request, exc_info, True)
                   finally:
                        publication.endRequest(request, obj)

transaction.begin() is called in beforeTraversal(). transaction.commit() or transaction.abort() in afterCall(). transaction.abort() is called in handleException().

But we find in endRequest that a Store is in an invalid state, which indicates transaction.commit() or transaction.abort() was not called.

Revision history for this message
Robert Collins (lifeless) wrote :

Ok, so we won't see the main request - however we can add as much data
to the oops as we want, as long as its available *somewhere* : may
want to drop down to the lower level API for that; or alternatively
stash the prior collected metadata somewhere. I don't have an opinion
about whats better [yet].

Revision history for this message
Stuart Bishop (stub) wrote :

In theory, if something screwed up previously in the request this bad there should be an OOPS and this will be the second (or third) OOPS being generated for that request. An OOPS could have a pointer to the previous OOPS generated by that thread. We might not even want to reset this between requests for those cases like this one where bad state might have been caused by a previous request.

Revision history for this message
Robert Collins (lifeless) wrote :

I like the idea of a 'last-oops-in-thread', reset when a request
succeeds - its nice and isolated and won't keep large obejct
references around or anything.

Revision history for this message
Stuart Bishop (stub) wrote :

We need a fresh OOPS after r14299 has been rolled out. This gives us the 'previous oops in request' link that I think we need to track this down further.

Changed in launchpad:
status: Triaged → Incomplete
William Grant (wgrant)
Changed in launchpad:
status: Incomplete → Invalid
Revision history for this message
William Grant (wgrant) wrote :

This often happens during DB outages, and indeed is produced by test_disconnectionerror_view_integration.

Changed in launchpad:
status: Invalid → Triaged
milestone: 10.01 → none
assignee: Stuart Bishop (stub) → nobody
Revision history for this message
William Grant (wgrant) wrote :

The bug is in LaunchpadBrowserPublication.handleException: we implement a special Retry handler that bypasses the superclass' handleException, so the transaction is never aborted before a retry is executed. Manually aborting before raising the retry fixes it.

William Grant (wgrant)
Changed in launchpad:
assignee: nobody → William Grant (wgrant)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.