Launchpad itself

codehosting1 still had connections 37 minutes after shutdown signle given to it

Bug #819884 reported by Gary Poster on 2011-08-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	High	Unassigned

Bug Description

LOSAs quiesce LP before a no-downtime deploy using https://pastebin.canonical.com/50597/ . Timeout is currently 30 minutes. Tom would like to decrease the timeout to 10. If timing out is OK (presumably this means that large in-progress code imports will be killed) then that's maybe all we need to do, though it might be nice if LP handles the hard-kill itself. If it is not OK, we need to figure out what to do about it.

Marking critical because I think no-downtime deploy problems are generally regarded as such.

Tags:

Revision history for this message

Tom Haddon (mthaddon) wrote on 2011-08-02: Re: codehosting "quiesce" hangs during no downtime deploy

Also, it took longer than 30 minutes when run against codehost1 today. I manually killed it after 37 minutes. I manually killed codehost2 after 15 minutes, as agreed with Gary.

summary:	- codeimport "quiesce" hangs during no downtime deploy + codehosting "quiesce" hangs during no downtime deploy
tags:	added: canonical-losa-lp

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

We expect long bzr operations - some clones are more than 30 minutes.

So the 30 minutes window is deliberate, and -perhaps- will need to be larger.

Once bug 819604 is fixed we can lower the timeout back to a small value (e.g. 60 seconds) (long enough for the server to respond, but not so long that idle connections hang around.

We need the deploy scripts to deal with this, as we will for all the services that go through a quiescing mode, and have long uninterruptible operations happen on them.

Changed in launchpad:
status:	Triaged → Invalid
status:	Invalid → Triaged
importance:	Critical → High
summary:	- codehosting "quiesce" hangs during no downtime deploy + codehosting1 still had connections 37 minutes after shutdown signle + given to it

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-02:

So, next time this happens - next time it goes beyond 30 minutes, we need to get the connection information (using ps) of the connections from the one that hasn't shutdown, and identify what operations are being done over them so we can determine if its a slow operation, or someone doing extensive work over bzrlib.

If its the former, we need a longer wait period; if its the latter, we need to drop the connection at an idle point in between two commands - and we may need a little bzrlib cooperation to make this happen (e.g. signal the bzrlib services once the master is told to shutdown, so they can shutdown at the first opportunity they get).

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.