Launchpad itself

DisconnectionErrors (already disconnected) happening again

Bug #504291 reported by Ursula Junque on 2010-01-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Fix Released	High	William Grant
	Storm	Invalid	Undecided	Unassigned

Bug Description

We had some in 3rd, then a bunch of DisconnectionErrors on 4th, only staging, as we can see in the summaries: https://devpad.canonical.com/~lpqateam/oops-summaries/staging-2010-01-04.html.

elmo reported to have upgraded postgresql on sourcherry (the staging DB server) at 2010-01-03, 23:35, but we don't know if that's the cause of all of them.

Now we can see them on lpnet and edge summaries, not staging:
https://devpad.canonical.com/~lpqateam/oops-summaries/lpnet-2010-01-06.html#exceptions
https://devpad.canonical.com/~lpqateam/oops-summaries/edge-2010-01-06.html#exceptions

Tags:

Related branches

lp:~stub/launchpad/bug-504291-disconnection-errors

Merged into lp:launchpad

Henning Eggers (community): Approve (code) on 2010-01-18

lp:~wgrant/launchpad/no-fdt-oopses

Merged into lp:launchpad at revision 17864

Colin Watson (community): Approve on 2015-12-07

Ursula Junque (ursinha) on 2010-01-07

affects:

blueprint → launchpad-foundations

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2010-01-07:

Looks as if a storm.Connection gets stuck in STATE_DISCONNECTED ("I'm broken and will stay that way until you roll me back") instead of STATE_RECONNECT ("I'll re-open when you need me again"). We need the latter at the end of a request so that the next request can re-use the connection.

The cause might be this:

canonical.launchpad.webapp.publication.LaunchpadBrowserPublication.afterCall does...

        if request.method in ['GET', 'HEAD']:
            self.finishReadOnlyRequest(txn)
        elif txn.isDoomed():
            txn.abort() # Sends an abort to the database, even though
            # transaction is still doomed.
        else:
            txn.commit()

This amounts to:

    if request.method == "POST" and not txn.isDoomed():
        txn.commit()
    else:
        txn.rollback()

Makes sense to me, but unfortunately storm.database.Connection.commit goes into STATE_DISCONNECTED on failure. Only a rollback can get it back to STATE_RECONNECT. So we'd need to catch disconnection errors coming out of txn.commit() and handle them by calling txn.abort() after all.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2010-01-07:

I missed a vital point in my explanation above: what this all boils down to is that AFAICS a connection gets stuck in this bad state if it's broken (e.g. by a database restart) just when it's in the middle of a commit.

The bad connection will stay in the store and get reused by requests, all of which fail with this error. And because they fail quickly, the bad connection becomes available again quickly to carry another request to its doom.

Revision history for this message

Guilherme Salgado (salgado) wrote on 2010-01-07:

looks just like bug 427990

Revision history for this message

Guilherme Salgado (salgado) wrote on 2010-01-07:

These DisconnectionErrors cause the transaction to be retried, but as Stuart explained in <https://bugs.edge.launchpad.net/launchpad-foundations/+bug/427990/comments/4>, we now log a soft (Informational) OOPS (meaning we don't fail the request) for DisconnectionErrors, so it's kind of expected to see them if the database was in recovery mode. The only problem is that none of the OOPSes there seem to be soft.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2010-01-07:

As for the "broken connection state," recommended practice from the Storm team seems to be to do an abort() at the very beginning of a request, in order to reset the connection.

Revision history for this message

Gary Poster (gary) wrote on 2010-01-11:

lib/canonical/launchpad/webapp/publication.py, line 167, in beforeTraversal, does a transaction.begin(), as per standard Zope practice. This is within the following traceback:

  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/taskthreads.py", line 48, in handlerThread
    task.service()
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/serverchannelbase.py", line 207, in service
    task.service()
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.server-3.6.1-py2.5.egg/zope/server/http/httptask.py", line 71, in service
    self.channel.server.executeRequest(self)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zc.zservertracelog-1.1.5-py2.5.egg/zc/zservertracelog/tracelog.py", line 114, in executeRequest
    response = self.application(env, start_response)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.app.wsgi-3.6.0-py2.5.egg/zope/app/wsgi/__init__.py", line 58, in __call__
    request = publish(request, handle_errors=handle_errors)
  File "/home/gary/launchpad/lp-sourcedeps/eggs/zope.publisher-3.10.0-py2.5.egg/zope/publisher/publish.py", line 142, in publish
    publication.beforeTraversal(request)
  File "/home/gary/launchpad/lp-branches/devel/lib/canonical/launchpad/webapp/publication.py", line 167, in beforeTraversal
    transaction.begin()

The transaction.begin method performs an abort if a previous transaction exists, as you can see in transaction/_manager.py.

If I understand correctly, then, we are already doing what you are suggesting, Jeroen. Could you confirm or deny?

On the bright side, if someone can identify what *does* need to happen, this point in beforeTraversal would probably be a reasonable place to do it.

Revision history for this message

Gary Poster (gary) wrote on 2010-01-12:

Next steps:

- Stuart Bishop will be trying to come up with a test to dupe the error condition Jeroen described in his first comment.
- I will verify with James Henstridge and/or Gustavo Niemeyer that transaction.begin's behavior should be sufficient from Storm/ZStorm's perspective.

Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → High

Stuart Bishop (stub) on 2010-01-13

Changed in storm:
status:	New → Invalid

Revision history for this message

Stuart Bishop (stub) wrote on 2010-01-13:

I haven't produced a test case, but I have traced through the logic in ZStorm and the transaction module. If an exception is raised by zstorm.StoreDataManager.commit() during txn.commit(), the transaction machinery will invoke zstorm.StoreDataManager.abort() which rollback on the store, setting the state to RECONNECT if the store is not connected.

I cannot locate a code path where afterCall() will leave the stores in a disconnected state.

Revision history for this message

Stuart Bishop (stub) wrote on 2010-01-13:

afterCall() only gets called when things are succeeding. If an exception is raised, we go to handleException which is much more complex and relies on a fair bit of code not raising exceptions before the transaction.abort() is finally called.

I think the best way to proceed is add cope *at the start* of endRequest in canonical.launchpad.webapp.publication to detect stores in a disconnected state and call rollback() on them, raising an exception or logging a soft oops. This should let us track down the code path that leaves our stores in a disconnected state.

Stuart Bishop (stub) on 2010-01-14

Changed in launchpad-foundations:
status:	Triaged → In Progress

Diogo Matsubara (matsubara) on 2010-01-14

Changed in launchpad-foundations:
assignee:	nobody → Stuart Bishop (stub)

Revision history for this message

Diogo Matsubara (matsubara) wrote on 2010-01-19: Bug fixed by a commit

#10

Fixed in devel r10192 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/revision/10192>

Changed in launchpad-foundations:
status:	In Progress → Fix Committed

Ursula Junque (ursinha) on 2010-01-25

Changed in launchpad-foundations:
milestone:	none → 10.01

Curtis Hovey (sinzui) on 2010-01-28

Changed in launchpad-foundations:
status:	Fix Committed → Fix Released

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-30:

#11

We currently have this in the code base:
            try:
                assert store._connection._state != STATE_DISCONNECTED, (
                    "Bug #504291: Store left in a disconnected state.")
            except AssertionError:
                # The Store is in a disconnected state. This should
                # not happen, as store.rollback() should have been called
                # by now. Log an OOPS so we know about this. This
                # is Bug #504291 happening.
                getUtility(IErrorReportingUtility).raising(
                    sys.exc_info(), request)
                # Repair things so the server can remain operational.
                store.rollback()

in endRequest: the code path that leads to disconnected stores is still live. This bug is merely worked-around. I'm going to reopen it as we're trying to get a real fix in place as part of the fastdowntime work.

Changed in launchpad:
status:	Fix Released → Triaged
tags:	added: fastdowntime

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-30:

#12

One easy fix would be to change the code to this:
           if store._connection._state == STATE_DISCONNECTED:
                # Repair things so the server can remain operational.
                store.rollback()

which would be a statement that we expect this to happen from time to time, and that when it happens we can deal with it.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-09-30:

#13

Recent occurence:
OOPS-2077A100

Revision history for this message

Stuart Bishop (stub) wrote on 2011-11-09:

#14

The code is already there to reset the store to a working state, done straight after logging the OOPS:

                # The Store is in a disconnected state. This should
                # not happen, as store.rollback() should have been called
                # by now. Log an OOPS so we know about this. This
                # is Bug #504291 happening.
                getUtility(IErrorReportingUtility).raising(
                    sys.exc_info(), request)
                # Repair things so the server can remain operational.
                store.rollback()

I'm not sure how to proceed on this. At the end of the request, we are detecting the invalid state and logging an OOPS. This OOPS (such as OOPS-2077A100) is empty, containing no call traces or information on how this request managed to screw up the database connection.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-11-09:

#15

I landed some disconnection related changes to Storm recently, which
are now also running in Launchpad. I can't find any recent occurrences
of this bug. Perhaps it's gone away?

Revision history for this message

Robert Collins (lifeless) wrote on 2011-11-09: Re: [Bug 504291] Re: DisconnectionErrors (already disconnected) happening again

#16

On Wed, Nov 9, 2011 at 10:57 PM, Stuart Bishop
<email address hidden> wrote:
> I'm not sure how to proceed on this. At the end of the request, we are
> detecting the invalid state and logging an OOPS. This OOPS (such as
> OOPS-2077A100) is empty, containing no call traces or information on how
> this request managed to screw up the database connection.

So, I guess we need to fix the oops. Is this perhaps logged after the
request is 'finished' (which wipes the oops metadata?)

Revision history for this message

Stuart Bishop (stub) wrote on 2011-11-09:

#17

The OOPS is certainly logged after the request is finished. This is in endRequest(), after the request and any error handling has been completed.

This is all tied up with error handling I think.

                    try:
                        try:
                            request.processInputs()
                            publication.beforeTraversal(request)
                            [...]
                            publication.afterCall(request, obj)
                        except:
                            exc_info = sys.exc_info()
                            publication.handleException(
                                obj, request, exc_info, True)
                   finally:
                        publication.endRequest(request, obj)

transaction.begin() is called in beforeTraversal(). transaction.commit() or transaction.abort() in afterCall(). transaction.abort() is called in handleException().

But we find in endRequest that a Store is in an invalid state, which indicates transaction.commit() or transaction.abort() was not called.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-11-09:

#18

Ok, so we won't see the main request - however we can add as much data
to the oops as we want, as long as its available *somewhere* : may
want to drop down to the lower level API for that; or alternatively
stash the prior collected metadata somewhere. I don't have an opinion
about whats better [yet].

Revision history for this message

Stuart Bishop (stub) wrote on 2011-11-10:

#19

In theory, if something screwed up previously in the request this bad there should be an OOPS and this will be the second (or third) OOPS being generated for that request. An OOPS could have a pointer to the previous OOPS generated by that thread. We might not even want to reset this between requests for those cases like this one where bad state might have been caused by a previous request.

Revision history for this message

Robert Collins (lifeless) wrote on 2011-11-10:

#20

I like the idea of a 'last-oops-in-thread', reset when a request
succeeds - its nice and isolated and won't keep large obejct
references around or anything.

Revision history for this message

Stuart Bishop (stub) wrote on 2011-11-15:

#21

We need a fresh OOPS after r14299 has been rolled out. This gives us the 'previous oops in request' link that I think we need to track this down further.

Changed in launchpad:
status:	Triaged → Incomplete

William Grant (wgrant) on 2013-01-30

Changed in launchpad:
status:	Incomplete → Invalid

Revision history for this message

William Grant (wgrant) wrote on 2015-12-07:

#22

This often happens during DB outages, and indeed is produced by test_disconnectionerror_view_integration.

Changed in launchpad:
status:	Invalid → Triaged
milestone:	10.01 → none
assignee:	Stuart Bishop (stub) → nobody

Revision history for this message

William Grant (wgrant) wrote on 2015-12-07:

#23

The bug is in LaunchpadBrowserPublication.handleException: we implement a special Retry handler that bypasses the superclass' handleException, so the transaction is never aborted before a retry is executed. Manually aborting before raising the retry fixes it.

William Grant (wgrant) on 2015-12-07

Changed in launchpad:
assignee:	nobody → William Grant (wgrant)
status:	Triaged → In Progress

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2015-12-07:

#24

Fixed in stable r17864 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/17864>.

tags:	added: qa-needstesting
Changed in launchpad:
status:	In Progress → Fix Committed

William Grant (wgrant) on 2015-12-08

tags:

added: qa-ok
removed: qa-needstesting

William Grant (wgrant) on 2015-12-09

Changed in launchpad:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.