When switching to read-only mode, we're left with lots of "idle/select waiting" connections to the DBs which may be blocking the schema upgrade process
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Launchpad itself |
High
|
Stuart Bishop |
Bug Description
During the 10.02 rollout, we dropped the read-only.txt file into the coderoot to switch to read-only and noticed that Launchpad had then switched into read-only mode per the message on every page. However, we were still seeing lots of "idle/select waiting" processes from the appservers against the launchpad_prod_N DBs even though the configs state:
ro_main_master: dbname=
ro_main_slave: dbname=
We're not sure if this blocks the schema upgrade in some way (per bug 531833 it's hard to tell whether the schema upgrade is being blocked or not), but after restarting all the app servers the schema upgrade seemed to progress a little better, so it suggests this was an issue.
affects: | launchpad → launchpad-foundations |
Changed in launchpad-foundations: | |
status: | New → Triaged |
Changed in launchpad-foundations: | |
milestone: | none → 10.03 |
Guilherme Salgado (salgado) wrote : | #2 |
It looks like these leftover connections are the ones used by the AUTH SLAVE stores.
They're left open because the read-only.txt file will cause only the connections used by the MAIN stores to be dropped/recreated.
I think we might not need to fix this as we should've stopped using the auth DB by the time of the next roll out.
Guilherme Salgado (salgado) wrote : | #3 |
There seems to be some connections coming from login.u.c as well: https:/
Tom Haddon (mthaddon) wrote : | #4 |
Which ones are those? If you mean the u1login ones, those are from the launchpad appservers that were dedicated to serving login traffic for Ubuntu One (but still under the login.launchpad.net domain, and still running the same code as other login.launchpad.net properties).
Tom Haddon (mthaddon) wrote : | #5 |
These can be decommissioned now that Ubuntu One is using login.ubuntu.com, but this work hasn't been completed yet.
Guilherme Salgado (salgado) wrote : | #6 |
Yeah, I thought the u1* connections came from login.u.c, so indeed all the connections we see there should be the ones used by AUTH SLAVE stores.
Unlike I said earlier, though, to have this fixed we need to split the auth tables out of the replication *and* get rid of auth stores from the Launchpad code so that all connections go to the read-only DB when the read-only.txt file exists.
Changed in launchpad-foundations: | |
milestone: | 10.03 → 10.04 |
Changed in launchpad-foundations: | |
status: | Triaged → In Progress |
assignee: | nobody → Guilherme Salgado (salgado) |
Changed in launchpad-foundations: | |
status: | In Progress → Fix Committed |
Guilherme Salgado (salgado) wrote : | #7 |
This happened again during yesterday's roll out. According to Tom Haddon, the following DBs had open connections after switching to read only:
hackberry/
Changed in launchpad-foundations: | |
status: | Fix Committed → Triaged |
Changed in launchpad-foundations: | |
assignee: | Guilherme Salgado (salgado) → nobody |
Gary Poster (gary) wrote : | #8 |
The following is the end of a mail thread. Last response is Salgado's.
> On Fri, 2010-05-21 at 16:49 -0400, Gary Poster wrote:
> > On May 8, 2010, at 8:34 AM, Stuart Bishop wrote:
> >
> > > On Fri, May 7, 2010 at 8:28 PM, Guilherme Salgado <email address hidden> wrote:
> > >
> > > The latter should work as long as the ROMDP returns the romode store
> > > (pointing to the standalone DB) rather than the regular slave one as it
> > > currently does. The ROMDP is already installed by beforeTraversal() (so
> > > we'll still be able to switch the web app to read-only without
> > > restarting) but as you said we'll need to install that policy on
> > > startup, for things like the librarian.
> > >
> > > We also need to make sure that all DatabasePolicies can detect a mode
> > > change and close the stores used by the previous mode, so that we don't
> > > leave open connections behind.
> >
> > Hmm... yeah. All the DatabasePolicies would need to know about RO
> > mode, which sucks. Instead, we could just do this in the
> > IStoreSelector - if read-only mode is detected, ignore any installed
> > DatabasePolicy and return the read-only Stores.
>
> Salgado, the thread stopped pretty soon after this suggestion. Does this
> sound like a reasonable plan of attack to you? If so, maybe I could have
> a quick call with you next week to make sure I understand what
> Foundations needs to do.
It sounds sane, although I don't quite remember how the StoreSelector
works (and more importantly, if everything uses it). I think we just
need to make sure there's no code used by the app servers that might
bypass the StoreSelector, also teaching the store selector to detect a
mode change and close the stores used in the previous mode.
Gary Poster (gary) wrote : | #9 |
Stuart says that everything does use the StoreSelector. He's going to investigate and see if he can get his proposed approach in for this cycle. If not, we'll push it to 10.06.
Changed in launchpad-foundations: | |
milestone: | 10.04 → 10.06 |
assignee: | nobody → Stuart Bishop (stub) |
milestone: | 10.06 → 10.05 |
Stuart Bishop (stub) wrote : | #10 |
Currently, the code that disconnects existing stores only disconnects the stores for the first thread that handled a request after the read-only mode switch over.
Fixing this means further refactoring is not necessary.
On Wed, 2010-05-26 at 10:43 +0000, Stuart Bishop wrote:
> Currently, the code that disconnects existing stores only disconnects
> the stores for the first thread that handled a request after the read-
> only mode switch over.
What makes you think that's what happen?
I don't think that's the case, and when experimenting locally I was able
to (in a few different occasions) get all stores disconnected after
switching to read only.
Stuart Bishop (stub) wrote : | #12 |
A bad assumption made me think that was what is happening.
I can't reproduce the hanging connections locally with launchpad trunk. Looking back at the mailing list thread it looks like it would be difficult to reproduce unless under load with a proper sized database:
""" This works by having the DB URI as a config variable that returns the
appropriate value according to the presence of the read-only.txt file,
together with a publication hook that forces the storm stores to
reconnect whenever there's a mode switch.
However, storm (through ZStorm) shares storm.database.
across threads (without saying so in the docs) and our custom version of
that class (LaunchpadDatabase) may change its DB URI (when there's a
config change) after instantiation[2]. This, in itself, is a problem,
but since we don't do runtime config changes except when running the
test suite (which is single-threaded), it's never affected us.
My changes for the read-only switch, though, introduced a way of
changing config values at runtime, thus exposing the problem.
This means there's a race condition when switching to read-only mode,
when a thread kicks in after another thread has noticed the config
change and reconnected its stores, thus causing the shared instance of
LaunchpadDa
they'll think they're connected to the correct DB (remember we use the
DB URI stored in LaunchpadDatabase for that) and won't tell their stores
to reconnect, leaving open connections to DBs that should have no
connections.
"""
The good news is that my fix sorts this too I think, as we no longer use the DB URI to detect if we have switched too or from read-only mode (and as an additional benefit, we no longer need to poke at Storm internals per the existing comment in the code).
Stuart Bishop (stub) wrote : | #13 |
So it is still the Database being a singleton was the issue - I just assumed a simpler reason for the disconnect to fail than the trickier race condition that salgado had tracked down. I knew it had something to do with _database being shared :)
Guilherme Salgado (salgado) wrote : | #14 |
I was able to reproduce it locally a bunch of times -- maybe thanks to
the not-so-fast laptop I'm using now.
I can help testing your fix if you'd like.
tags: | added: canonical-losa-lp |
Fixed in stable r10928 <http://
Changed in launchpad-foundations: | |
status: | Triaged → Fix Committed |
tags: | added: qa-needstesting |
Changed in launchpad-foundations: | |
status: | Fix Committed → Fix Released |
These connections were seen at least 30 minutes after the read-only switch, fwiw (just in case it wasn't clear whether they were just still there because there we still queries waiting to finish)