codehosting1 still had connections 37 minutes after shutdown signle given to it

Bug #819884 reported by Gary Poster
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

LOSAs quiesce LP before a no-downtime deploy using https://pastebin.canonical.com/50597/ . Timeout is currently 30 minutes. Tom would like to decrease the timeout to 10. If timing out is OK (presumably this means that large in-progress code imports will be killed) then that's maybe all we need to do, though it might be nice if LP handles the hard-kill itself. If it is not OK, we need to figure out what to do about it.

Marking critical because I think no-downtime deploy problems are generally regarded as such.

Revision history for this message
Tom Haddon (mthaddon) wrote : Re: codehosting "quiesce" hangs during no downtime deploy

Also, it took longer than 30 minutes when run against codehost1 today. I manually killed it after 37 minutes. I manually killed codehost2 after 15 minutes, as agreed with Gary.

summary: - codeimport "quiesce" hangs during no downtime deploy
+ codehosting "quiesce" hangs during no downtime deploy
tags: added: canonical-losa-lp
Revision history for this message
Robert Collins (lifeless) wrote :

We expect long bzr operations - some clones are more than 30 minutes.

So the 30 minutes window is deliberate, and -perhaps- will need to be larger.

Once bug 819604 is fixed we can lower the timeout back to a small value (e.g. 60 seconds) (long enough for the server to respond, but not so long that idle connections hang around.

We need the deploy scripts to deal with this, as we will for all the services that go through a quiescing mode, and have long uninterruptible operations happen on them.

Changed in launchpad:
status: Triaged → Invalid
status: Invalid → Triaged
importance: Critical → High
summary: - codehosting "quiesce" hangs during no downtime deploy
+ codehosting1 still had connections 37 minutes after shutdown signle
+ given to it
Revision history for this message
Robert Collins (lifeless) wrote :

So, next time this happens - next time it goes beyond 30 minutes, we need to get the connection information (using ps) of the connections from the one that hasn't shutdown, and identify what operations are being done over them so we can determine if its a slow operation, or someone doing extensive work over bzrlib.

If its the former, we need a longer wait period; if its the latter, we need to drop the connection at an idle point in between two commands - and we may need a little bzrlib cooperation to make this happen (e.g. signal the bzrlib services once the master is told to shutdown, so they can shutdown at the first opportunity they get).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.