appserver deployment must not interrupt live requests

Bug #640065 reported by Robert Collins
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

See bug 380504 for historical context.

In short:
 - when we start to upgrade an appserver it may have requests live
 - haproxy will keep sending requests until the appserver falls over and dies
 - RT 41503 will change this to workaround in a kludgy fashion.

We need to change the appservers to have a graceful shutdown:
 - close the socket
 - report that its closed
 - let requests complete
 - after a configured interval (200 seconds probably) forcibly take down remaining threads

This will let the overall process become:
 - take the server out of rotation (sysadmin change, rt 41503)
 - start the upgrade by telling the appserver to gracefully stop
 - wait for the 'socket xxxx closed' message
 - start the new instance (sysadmin change, future rt)
 - if the old appserver hasn't shut down after (long time) kill -9 it (belt and braces)

ha proxy may stop forwarding requests cleanly once the listening socket is gone, but we want to be completely robust against bugs there.

Revision history for this message
Robert Collins (lifeless) wrote :

This is why I say 200 seconds:

140.59s OOPS-1719D2284 https://api.launchpad.net/beta
  102.74s OOPS-1719E2173 Person:+rdf
   94.33s OOPS-1719F2300 https://api.launchpad.net/devel
   77.05s OOPS-1719H2208 https://api.launchpad.net/1.0

description: updated
description: updated
Revision history for this message
Tom Haddon (mthaddon) wrote :

I'm a little confused here - how on earth can we be having connections that take so long? It seems like this is largely recorded as non-SQL time, so what is it?

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 640065] Re: appserver deployment must not interrupt live requests

On Mon, Sep 27, 2010 at 9:34 PM, Tom Haddon <email address hidden> wrote:
> I'm a little confused here - how on earth can we be having connections
> that take so long? It seems like this is largely recorded as non-SQL
> time, so what is it?

Two places:

Firstly, the 'legimitate' ones: proxied librarian files are handed off
to the appserver event loop and served (fairly) efficiently, if you
ignore the whole copy-to-a-tempfile thing. They are dependent on
client performance, and can be huge (openoffice debs, for instance).
Deploying the restricted-librarian-to-public will fix that. Its
pending QA as a high-sev RT.

Secondly, the timeout code for requests works by raising an error when
something checks 'is there time remaining'. Anything that does not
check, will not timeout. We don't [yet] try to inject errors into the
thread mid-request, we let the threads cooperate.

Currently the following things check for timeouts:
 - google webservice lookups (at the start)
 - DB requests
 - possibly a couple of other things

I plan to put many more things into that set, but doing so will
instantly transform those pages into hard errors, so am biding my
time.

Note to that a page which does a 2 second query and spends 60 seconds
in a bad python loop, simply won't timeout, so we're not going to
permanently fix this until we can inject an exception into the thread
(and have moved the restricted librarian stuff off).

Oh, and then there is AJAX long-poll in the future ;) - so we're
looking at needing long migration times irrespective of bugs.

-Rob

description: updated
Revision history for this message
Gary Poster (gary) wrote :

RT 41503 is a pretty common approach for doing this in app servers. Is the benefit of a more graceful app story significant? If so, still, given that 41503 will be addressed, and given work on bug 636695 to make killing the appservers conclusive (again), is this really of a high importance?

Revision history for this message
Robert Collins (lifeless) wrote :

I think its important because without a proper graceful story we'll
only be guessing whether requests are finished, or not finished - the
rate at which servers come down (cleanly) and come up (cleanly) is a
strong determinator for how fast deployments can happen.

tags: added: rfwtad
Revision history for this message
Stuart Bishop (stub) wrote :

On Wed, Oct 6, 2010 at 4:08 AM, Robert Collins
<email address hidden> wrote:
> I think its important because without a proper graceful story we'll
> only be guessing whether requests are finished, or not finished - the
> rate at which servers come down (cleanly) and come up (cleanly) is a
> strong determinator for how fast deployments can happen.

haproxy tells us when requests are finished.

1) Tell load balancer to take a server out of rotation
2) Wait until load balancer reports all connections are finished, or
timeout hit.
3) Bounce appserver
4) Tell load balancer to put server back in rotation

Note that this process works even when an appserver is totally messed
up, and for every other service we can load balance behind haproxy.

--
Stuart Bishop <email address hidden>
http://www.stuartbishop.net/

Revision history for this message
Gary Poster (gary) wrote :

Stuart's argument makes sense to me.

Revision history for this message
Martin Pool (mbp) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.