Updates to SFTP server leak file handles in use_forking_server mode
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
High
|
John A Meinel |
Bug Description
We tried to enable "use_forking_server = True" on production, and in about 1 hour it seemed to start failing do to "No available file descriptors".
It is unclear where the leak is, since other load testing on QAStaging has failed to see similar failures.
However, it does prevent enabling it on production.
Related branches
- Brad Crittenden (community): Approve (code)
- Benji York (community): Approve (code*)
-
Diff: 137 lines (+62/-31)1 file modifiedapilib/lp/codehosting/sshserver/session.py (+62/-31)
John A Meinel (jameinel) wrote : | #1 |
John A Meinel (jameinel) wrote : | #2 |
I ran hammer_ssh against qastaging from devpad. Using these options:
$ python ~/hammer_ssh.py --load=200 --delay-
That attempts to spawn up to 200 concurrent processes all trying to run "echo hello | ssh bazaar.
Spawned 1791 connections in 66.106s
average time: 6.893s
1791 connections in 66.1s means we started 27 connections per second. Since the average run time was 6.9s, that means we averaged 27*6.9=187 concurrent connections.
Not all of these would have had active bzr+ssh processes at the same time. However, it does show that we can spawn thousands of connections over time, and it doesn't appear to be leaking any file handles.
Robert Collins (lifeless) wrote : | #3 |
Can you try running up 400 concurrent processes? Expected behaviour is graceful queuing I guess.
John A Meinel (jameinel) wrote : | #4 |
I did try 300, which gave me 1950 connections in 66s, and slightly slower average time.
I think the issue there is probably the connection handshake overhead. With 1 concurrent connection, the connection time is about 200ms. So we probably can't do much more than 20-40 connections per second.
Robert Collins (lifeless) wrote : | #5 |
So, I'd ramp up with (say) 4 new connections in parallel, and wait for the hello rpc to work before starting another new connection.
We got some new evidence too:
12:52 < spm> erm. is tehre a reason we'd have 49 x /usr/bin/python2.6 /srv/bazaar.
Suggesting we're failing to clean up, or something.
Francis J. Lacoste (flacoste) wrote : | #6 |
Michael, do you have any insights into how to diagnose this problem?
John A Meinel (jameinel) wrote : | #7 |
I think Robert mentions it, but I think the problem could certainly be *not* that we are leaking handles, but leaking processes that aren't shutting down when they should.
I saw that a little bit in the bzr-lp-
Though if a sysadmin force-killed the master process, then it wouldn't kill the child processes, etc.
If we can get a soft kill on some of those processes, to see what traceback they put into the log file, that could be pretty helpful. I would try SIGTERM, SIGINT, SIGABRT and maybe SIGQUIT before SIGKILL. SIGQUIT could have problems because it would want to drop into the debugger, but maybe that in itself would give us some info.
Michael Hudson-Doyle (mwhudson) wrote : | #8 |
I guess it would have been good to have run lsof on the process when it was running out of file handles :/
I also suppose it might be good to call setproctitle in the forking service's children to record some information that way.
Apart from load, which jam seems to have covered, I can't think of any meaningful difference between staging and production here. Maybe we should hold some connections open for longer in the load testing?
John A Meinel (jameinel) wrote : Re: [Bug 717345] Re: Updates to SFTP server leak file handles in use_forking_server mode | #9 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/15/2011 3:41 PM, Michael Hudson-Doyle wrote:
> I guess it would have been good to have run lsof on the process when it
> was running out of file handles :/
>
> I also suppose it might be good to call setproctitle in the forking
> service's children to record some information that way.
I'm pretty sure it does, since that is done already in the lp-serve
context. I'm not sure why "ps" doesn't see it, but it appears that top
does...
>
> Apart from load, which jam seems to have covered, I can't think of any
> meaningful difference between staging and production here. Maybe we
> should hold some connections open for longer in the load testing?
>
Yeah, I'm working on that in my script now. "--request-count" will set
how many times a given connection says "hello", waits for a response,
and says it again.
I've seen the script start crashing for various reasons (which looks
like bad messages, rejections, etc.), but the script doesn't handle
failure very well (tends to spew 100s of lines to the terminal). So once
I've sorted that out, I'll see what's up.
SPM noticed that there were about 50 'zombie' launchpad-
processes still running on Crowberry. So I'm guessing they forcibly
killed the master process before the 300s timeout, which then prevented
it from forcibly killing all of its children.
I would really like to have known what state those 50 processes were in.
If they got a failure that wasn't being handled, and somehow left them
in "I'm still running" vs the Conch server, so that it held those
handles open indefinitely.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
X7gAni0IP33pxgT
=QcJi
-----END PGP SIGNATURE-----
William Grant (wgrant) wrote : | #10 |
Those launchpad-
A few other processes were straced, with every one hung opening one of the fifos.
William Grant (wgrant) wrote : | #11 |
Digging more into the original production incident, I have further insight and also compelling evidence that all 47 forking service processes killed well after the incident were hung the same way as tellurium's.
At 11:01, right before the file handle exhaustion began, 117 forking services were running. Assuming that bzr-sftp needs 3 handles to talk to each child plus 1 for the SSH connection, that should only be around 500 handles. Well below the limit.
At 11:23 there were 140 forking services. 64 children remained from before file handle exhaustion, the oldest from a minute after codehosting was started post-rollout. 48 children were uninitialized -- they all had the original ps, having not yet had the branch path appended.
The master forking service was stopped a few seconds after those counts were taken, at which point it began a 300s timeout waiting for all its children to die. At 11:24 136 remained, but by 11:25 the master had been kill -9'd and this was down to 47. Those same 47 remained for days until a LOSA killed them all, and all were uninitialised, so it looks like every initialised processes died in a timely manner.
Even though some were more than an hour old, the 117 children alive at the time of the major incident should not have been enough to do any damage. The forked children doesn't seem to be *too* badly behaved, as we often have multi-hour processes with the old service too.
William Grant (wgrant) wrote : | #12 |
There are actually 5 handles per connection: the extra is a unix socket to the forking master.
Locally I cannot reproduce the suspected initial leak. However, once the FD limit is reached it starts leaking huge numbers of master sockets and a few child fifos. So an lsof afterwards would probably have had limited utility.
John A Meinel (jameinel) wrote : | #13 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/16/2011 4:18 AM, William Grant wrote:
> There are actually 5 handles per connection: the extra is a unix socket
> to the forking master.
>
> Locally I cannot reproduce the suspected initial leak. However, once the
> FD limit is reached it starts leaking huge numbers of master sockets and
> a few child fifos. So an lsof afterwards would probably have had limited
> utility.
>
So is this the FD limit in the Conch process? Causing children to be
spawned but not connected to, and then just go dormant. Do we then run
out of system handles, causing the Conch process to continue this cycle?
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
6SsAni0abIo+
=dKwp
-----END PGP SIGNATURE-----
John A Meinel (jameinel) wrote : | #14 |
- hammer_ssh.py Edit (5.4 KiB, text/x-python)
I'm attaching the script I was using to load test qastaging. It should be usable to load test a local instance.
The way I hammered qastaging to cause failure was with:
$ python ~/hammer_ssh.py --load=200 --delay-
Which is:
1) Try to keep 200 connections active concurrently.
2) Wait 1s between spawning each request, hopefully so that you don't spawn 200, they all finish, you spawn another 200, etc. This could certainly be tweaked to 0.1s, or whatever makes sense.
3) Run the overall script for 1 hour
4) For each process, have it say "hello" 1000 times, waiting for a response after each.
We *should* be able to satisfy these 200 connections without running out of file handles. We can always increase that to 300, etc. Of course the *local* process may have issues with too many file handles as well. I haven't seen anything that I would specifically point at that. It should only be 2 new handles per child (stdin, stdout, child uses stderr from the master.)
It sounds like the Conch server isn't killing off children that get only partially initialized. So say we get a new connection request, and have 2 file handles available.
We will connect to stdin, and stdout, but fail to connect to stderr. At which point, the child process gets hung waiting for us to connect, and the master process decides to go on its merry way. It still *knows* about the child, so it doesn't try to kill it, and it holds stdin and stdout open, in the fruitless hopes that it will be able to talk to the child some day later.
John A Meinel (jameinel) wrote : | #15 |
I can now reproduce this locally, without needing tons of free memory, etc, thanks to the glories of 'ulimit'.
ulimit -n 100
make run_codehosting
hammer_ssh.py
python hammer_ssh.py --load=20 --request-count=200 --server bazaar.
Starts crashing pretty quickly. I'll attach the lsof output. But basically, it still has a bunch of the handles open. Not only that, but my machine still has a bunch of 'ssh' processes that are still connected, and a fair number of launchpad-
So looking pretty good from the point of being able to reproduce it.
John A Meinel (jameinel) wrote : | #16 |
Note that when terminating the codehosting service, a fair number of the spawned clients decided to wait for child processes to exit. So one of the failures is that we aren't dying cleanly from the spawned clients.
Michael Hudson-Doyle (mwhudson) wrote : | #17 |
So, to possibly state the obvious, the problem here was that the conch process ran out of file handles, followed by poor error recovery?
In which case there are two more or less questions: 1) why did conch run out of file handles? 2) can we handle this situation better?
For 1), is it just that there is an extra fd open per connection, or is there a leak? If the former, then we're presumably pretty close to hitting the fd limit in production from time to time!
John A Meinel (jameinel) wrote : | #18 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/16/2011 2:35 PM, Michael Hudson-Doyle wrote:
> So, to possibly state the obvious, the problem here was that the conch
> process ran out of file handles, followed by poor error recovery?
Basically.
>
> In which case there are two more or less questions: 1) why did conch run
> out of file handles? 2) can we handle this situation better?
1) because we currently get close to the limit of connections,
especially right after a major rollout
2) yes
>
> For 1), is it just that there is an extra fd open per connection, or is
> there a leak? If the former, then we're presumably pretty close to
> hitting the fd limit in production from time to time!
>
I'm pretty sure we already run out from time to time, but handle
recovery better.
Here is what I can sort out:
1) We might be using 1 more handle for requesting the fork, but nothing
I can find is 'leaked'.
2) We are close to running out of handles in production right now.
3) The spawnProcess() code recovers cleanly from running out of
handles. Causing *that* connection to fail, but future connections
are unaffected.
4) The LPForkingService code leaks handles when it fails to spawn. So
while we have 1k handles, once we've started running out of them, we
have fewer and fewer to work with.
Which means that we run into the limit more and more until the
service is completely unusable.
5) The children spawned could clean up slightly better than they do
today.
(3) is true because we create the pipes *before* we fork+exec, and
if we were to run out of handles, we would just never exec in the
first place. Since we now have to spawn a process before we create
the handles to connect to it, this layering is broken. One crummy
option is to hold open a bunch of file handles, and close them when
we want to fork a new process, etc.
I'm working on some code to clean up the children forked, and then I
want to clean up the twisted code to make sure we don't leak during
failure periods.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
pQ8AoIu2f+
=oy5y
-----END PGP SIGNATURE-----
John A Meinel (jameinel) wrote : | #19 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/16/2011 2:35 PM, Michael Hudson-Doyle wrote:
> So, to possibly state the obvious, the problem here was that the conch
> process ran out of file handles, followed by poor error recovery?
>
> In which case there are two more or less questions: 1) why did conch run
> out of file handles? 2) can we handle this situation better?
>
> For 1), is it just that there is an extra fd open per connection, or is
> there a leak? If the former, then we're presumably pretty close to
> hitting the fd limit in production from time to time!
>
Looking again, there is one more handle per connection. It is the one we
use to detect the process exiting.
So we now have:
1) network socket from client
2) stdin
3) stderr
4) stdout
5) socket from forking server which will write, eg 'exited 10\n', when
the child process finally exits.
So we've effectively reduced our peak concurrent requests from about 250
down to about 200.
Coupled with the poor cleanup once we've gotten into that situation, it
gets bad fast.
Note that Andrew has already started work on allowing us to run multiple
Conch processes (and associated forking processes), so that we can
handle no-downtime deployments anyway. Which also helps us with high
availability, etc.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
86sAoM3lbLg43aN
=eUmX
-----END PGP SIGNATURE-----
Francis J. Lacoste (flacoste) wrote : | #20 |
We could also increase the limit.
Robert Collins (lifeless) wrote : | #21 |
"
Note that Andrew has already started work on allowing us to run multiple
Conch processes (and associated forking processes), so that we can
handle no-downtime deployments anyway. Which also helps us with high
availability, etc."
what does this mean? I wouldn't want to run > 1 conch per forking
service, or vice versa. If hes simply talking about having multiple
conch processes with an haproxy front end, we're already ready to do
that; the thing that andrew was working on for a status page is about
graceful handoffs when doing nodowntime deploys (and thats pretty
important in the long-running connection model of bzr).
-Rob
Martin Pool (mbp) wrote : | #22 |
On 17 February 2011 09:27, Robert Collins <email address hidden> wrote:
> "
> Note that Andrew has already started work on allowing us to run multiple
> Conch processes (and associated forking processes), so that we can
> handle no-downtime deployments anyway. Which also helps us with high
> availability, etc."
>
> what does this mean? I wouldn't want to run > 1 conch per forking
> service, or vice versa. If hes simply talking about having multiple
> conch processes with an haproxy front end, we're already ready to do
> that; the thing that andrew was working on for a status page is about
> graceful handoffs when doing nodowntime deploys (and thats pretty
> important in the long-running connection model of bzr).
I think it's just that John sees this feature as "being worked on"
whereas Robert sees it as "ready to deploy it and to do follow on
work." (bug 702024, or RT 40480). It would let us if we wanted
limit each conch to just 256 connections. I don't know of any plans
to have other than one conch per forking server.
John A Meinel (jameinel) wrote : | #23 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
...
>> what does this mean? I wouldn't want to run > 1 conch per forking
>> service, or vice versa. If hes simply talking about having multiple
>> conch processes with an haproxy front end, we're already ready to do
>> that; the thing that andrew was working on for a status page is about
>> graceful handoffs when doing nodowntime deploys (and thats pretty
>> important in the long-running connection model of bzr).
>
> I think it's just that John sees this feature as "being worked on"
> whereas Robert sees it as "ready to deploy it and to do follow on
> work." (bug 702024, or RT 40480). It would let us if we wanted
> limit each conch to just 256 connections. I don't know of any plans
> to have other than one conch per forking server.
>
Right. It isn't in production, thus it is "being worked on".
As an aside, I would probably limit it a bit further than that for
safety margin, etc. But yes, theoretically limiting in the system to
prevent blowout.
We talked about increasing the number of file handles, but when wgrant
tested that on his machine, it caused problems because 'select' doesn't
like more than 1024 handles, and python has some internal MAX_FD set, etc.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
kdoAn2wCobBycdC
=/Dni
-----END PGP SIGNATURE-----
Michael Hudson-Doyle (mwhudson) wrote : | #24 |
On Thu, 17 Feb 2011 17:37:08 -0000, John A Meinel <email address hidden> wrote:
> We talked about increasing the number of file handles, but when wgrant
> tested that on his machine, it caused problems because 'select' doesn't
> like more than 1024 handles, and python has some internal MAX_FD set, etc.
If we're close to the select() limit in production, we should probably
be using a different reactor *anyway*.
This translates to passing -r poll or -r epoll to twistd, so I think it
would be something the losas need to change in an init script. And test
on staging first, etc etc.
Cheers,
mwh
John A Meinel (jameinel) wrote : | #25 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/17/2011 2:02 PM, Michael Hudson-Doyle wrote:
> On Thu, 17 Feb 2011 17:37:08 -0000, John A Meinel <email address hidden> wrote:
>> We talked about increasing the number of file handles, but when wgrant
>> tested that on his machine, it caused problems because 'select' doesn't
>> like more than 1024 handles, and python has some internal MAX_FD set, etc.
>
> If we're close to the select() limit in production, we should probably
> be using a different reactor *anyway*.
>
> This translates to passing -r poll or -r epoll to twistd, so I think it
> would be something the losas need to change in an init script. And test
> on staging first, etc etc.
>
> Cheers,
> mwh
>
Or end-run around it and go with haproxy load between multiple instances
and cap their concurrent requests to about 100 or so. Which has a lot of
other benefits.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
N0YAoNB/
=l/Dl
-----END PGP SIGNATURE-----
William Grant (wgrant) wrote : | #26 |
I tested --reactor poll yesterday and got up to 500 concurrent connections, at which point hammer_ssh ran out of FDs. So it seems to work.
John A Meinel (jameinel) wrote : | #27 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 2/17/2011 3:53 PM, William Grant wrote:
> I tested --reactor poll yesterday and got up to 500 concurrent
> connections, at which point hammer_ssh ran out of FDs. So it seems to
> work.
>
Did you also use ulimit to change the number of available fds?
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://
iEYEARECAAYFAk1
0iMAn1AW3OB3JfX
=ifiw
-----END PGP SIGNATURE-----
William Grant (wgrant) wrote : | #28 |
On 19/02/11 03:03, John A Meinel wrote:
> On 2/17/2011 3:53 PM, William Grant wrote:
>> I tested --reactor poll yesterday and got up to 500 concurrent
>> connections, at which point hammer_ssh ran out of FDs. So it seems to
>> work.
>
>
> Did you also use ulimit to change the number of available fds?
Ah, yes, I increased it to 10000 in limits.conf.
Launchpad QA Bot (lpqabot) wrote : | #29 |
Fixed in stable r12415 (http://
Changed in launchpad: | |
milestone: | none → 11.03 |
tags: | added: qa-untestable |
Changed in launchpad: | |
status: | In Progress → Fix Committed |
Changed in launchpad: | |
status: | Fix Committed → Fix Released |
Note that I have been unable to reproduce file handle leaks in dev or on qastaging. I added code to log what file handles are used for each spawn, and after doing lots of concurrent requests, the handles reset back to the low numbered handles.
I don't know what could have been going on in production that would have been different than locally. Still investigating.
We might need some LOSA time to test code on qastaging, in case it is a site-specific issue. I don't see how it should be, but if I can't reproduce it, I really don't know how to fix it.