When switching to read-only mode, we're left with lots of "idle/select waiting" connections to the DBs which may be blocking the schema upgrade process

Bug #531834 reported by Tom Haddon on 2010-03-04
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Stuart Bishop

Bug Description

During the 10.02 rollout, we dropped the read-only.txt file into the coderoot to switch to read-only and noticed that Launchpad had then switched into read-only mode per the message on every page. However, we were still seeing lots of "idle/select waiting" processes from the appservers against the launchpad_prod_N DBs even though the configs state:

ro_main_master: dbname=launchpad_standalone_1 host=chokecherry.canonical.com
ro_main_slave: dbname=launchpad_standalone_1 host=chokecherry.canonical.com

We're not sure if this blocks the schema upgrade in some way (per bug 531833 it's hard to tell whether the schema upgrade is being blocked or not), but after restarting all the app servers the schema upgrade seemed to progress a little better, so it suggests this was an issue.

Tom Haddon (mthaddon) wrote :

These connections were seen at least 30 minutes after the read-only switch, fwiw (just in case it wasn't clear whether they were just still there because there we still queries waiting to finish)

Changed in launchpad:
importance: Undecided → High
Brad Crittenden (bac) on 2010-03-04
affects: launchpad → launchpad-foundations
Changed in launchpad-foundations:
status: New → Triaged
Gary Poster (gary) on 2010-03-04
Changed in launchpad-foundations:
milestone: none → 10.03
Guilherme Salgado (salgado) wrote :

It looks like these leftover connections are the ones used by the AUTH SLAVE stores.

They're left open because the read-only.txt file will cause only the connections used by the MAIN stores to be dropped/recreated.

I think we might not need to fix this as we should've stopped using the auth DB by the time of the next roll out.

Guilherme Salgado (salgado) wrote :

There seems to be some connections coming from login.u.c as well: https://devpad.canonical.com/~salgado/connections-prod_3.txt

Tom Haddon (mthaddon) wrote :

Which ones are those? If you mean the u1login ones, those are from the launchpad appservers that were dedicated to serving login traffic for Ubuntu One (but still under the login.launchpad.net domain, and still running the same code as other login.launchpad.net properties).

Tom Haddon (mthaddon) wrote :

These can be decommissioned now that Ubuntu One is using login.ubuntu.com, but this work hasn't been completed yet.

Guilherme Salgado (salgado) wrote :

Yeah, I thought the u1* connections came from login.u.c, so indeed all the connections we see there should be the ones used by AUTH SLAVE stores.

Unlike I said earlier, though, to have this fixed we need to split the auth tables out of the replication *and* get rid of auth stores from the Launchpad code so that all connections go to the read-only DB when the read-only.txt file exists.

Gary Poster (gary) on 2010-03-30
Changed in launchpad-foundations:
milestone: 10.03 → 10.04
Changed in launchpad-foundations:
status: Triaged → In Progress
assignee: nobody → Guilherme Salgado (salgado)
Changed in launchpad-foundations:
status: In Progress → Fix Committed
Guilherme Salgado (salgado) wrote :

This happened again during yesterday's roll out. According to Tom Haddon, the following DBs had open connections after switching to read only:

hackberry/launchpad_prod_1 and chokecherry/launchpad_prod_2 - we didn't have any to wildcherry/launchpad_prod_3 because we run a script to specifically kill connections we don't want on there (we're considering running it on the other two as well to be sure from now on)

Changed in launchpad-foundations:
status: Fix Committed → Triaged
Gary Poster (gary) on 2010-05-21
Changed in launchpad-foundations:
assignee: Guilherme Salgado (salgado) → nobody
Gary Poster (gary) wrote :

The following is the end of a mail thread. Last response is Salgado's.

> On Fri, 2010-05-21 at 16:49 -0400, Gary Poster wrote:
> > On May 8, 2010, at 8:34 AM, Stuart Bishop wrote:
> >
> > > On Fri, May 7, 2010 at 8:28 PM, Guilherme Salgado <email address hidden> wrote:
> > >
> > > The latter should work as long as the ROMDP returns the romode store
> > > (pointing to the standalone DB) rather than the regular slave one as it
> > > currently does. The ROMDP is already installed by beforeTraversal() (so
> > > we'll still be able to switch the web app to read-only without
> > > restarting) but as you said we'll need to install that policy on
> > > startup, for things like the librarian.
> > >
> > > We also need to make sure that all DatabasePolicies can detect a mode
> > > change and close the stores used by the previous mode, so that we don't
> > > leave open connections behind.
> >
> > Hmm... yeah. All the DatabasePolicies would need to know about RO
> > mode, which sucks. Instead, we could just do this in the
> > IStoreSelector - if read-only mode is detected, ignore any installed
> > DatabasePolicy and return the read-only Stores.
>
> Salgado, the thread stopped pretty soon after this suggestion. Does this
> sound like a reasonable plan of attack to you? If so, maybe I could have
> a quick call with you next week to make sure I understand what
> Foundations needs to do.

It sounds sane, although I don't quite remember how the StoreSelector
works (and more importantly, if everything uses it). I think we just
need to make sure there's no code used by the app servers that might
bypass the StoreSelector, also teaching the store selector to detect a
mode change and close the stores used in the previous mode.

Gary Poster (gary) wrote :

Stuart says that everything does use the StoreSelector. He's going to investigate and see if he can get his proposed approach in for this cycle. If not, we'll push it to 10.06.

Changed in launchpad-foundations:
milestone: 10.04 → 10.06
assignee: nobody → Stuart Bishop (stub)
milestone: 10.06 → 10.05
Stuart Bishop (stub) wrote :

Currently, the code that disconnects existing stores only disconnects the stores for the first thread that handled a request after the read-only mode switch over.

Fixing this means further refactoring is not necessary.

On Wed, 2010-05-26 at 10:43 +0000, Stuart Bishop wrote:
> Currently, the code that disconnects existing stores only disconnects
> the stores for the first thread that handled a request after the read-
> only mode switch over.

What makes you think that's what happen?

I don't think that's the case, and when experimenting locally I was able
to (in a few different occasions) get all stores disconnected after
switching to read only.

Stuart Bishop (stub) wrote :

A bad assumption made me think that was what is happening.

I can't reproduce the hanging connections locally with launchpad trunk. Looking back at the mailing list thread it looks like it would be difficult to reproduce unless under load with a proper sized database:

""" This works by having the DB URI as a config variable that returns the
    appropriate value according to the presence of the read-only.txt file,
    together with a publication hook that forces the storm stores to
    reconnect whenever there's a mode switch.

    However, storm (through ZStorm) shares storm.database.Database instances
    across threads (without saying so in the docs) and our custom version of
    that class (LaunchpadDatabase) may change its DB URI (when there's a
    config change) after instantiation[2]. This, in itself, is a problem,
    but since we don't do runtime config changes except when running the
    test suite (which is single-threaded), it's never affected us.

    My changes for the read-only switch, though, introduced a way of
    changing config values at runtime, thus exposing the problem.

    This means there's a race condition when switching to read-only mode,
    when a thread kicks in after another thread has noticed the config
    change and reconnected its stores, thus causing the shared instance of
    LaunchpadDatabase to change its DB URI. When other threads kick in,
    they'll think they're connected to the correct DB (remember we use the
    DB URI stored in LaunchpadDatabase for that) and won't tell their stores
    to reconnect, leaving open connections to DBs that should have no
    connections.
"""

The good news is that my fix sorts this too I think, as we no longer use the DB URI to detect if we have switched too or from read-only mode (and as an additional benefit, we no longer need to poke at Storm internals per the existing comment in the code).

Stuart Bishop (stub) wrote :

So it is still the Database being a singleton was the issue - I just assumed a simpler reason for the disconnect to fail than the trickier race condition that salgado had tracked down. I knew it had something to do with _database being shared :)

Guilherme Salgado (salgado) wrote :

I was able to reproduce it locally a bunch of times -- maybe thanks to
the not-so-fast laptop I'm using now.

I can help testing your fix if you'd like.

Tom Haddon (mthaddon) on 2010-05-28
tags: added: canonical-losa-lp
Changed in launchpad-foundations:
status: Triaged → Fix Committed
tags: added: qa-needstesting
Curtis Hovey (sinzui) on 2010-06-02
Changed in launchpad-foundations:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers