Updates to SFTP server leak file handles in use_forking_server mode

Bug #717345 reported by John A Meinel on 2011-02-11
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
John A Meinel

Bug Description

We tried to enable "use_forking_server = True" on production, and in about 1 hour it seemed to start failing do to "No available file descriptors".

It is unclear where the leak is, since other load testing on QAStaging has failed to see similar failures.

However, it does prevent enabling it on production.

Related branches

Revision history for this message
John A Meinel (jameinel) wrote :

Note that I have been unable to reproduce file handle leaks in dev or on qastaging. I added code to log what file handles are used for each spawn, and after doing lots of concurrent requests, the handles reset back to the low numbered handles.

I don't know what could have been going on in production that would have been different than locally. Still investigating.

We might need some LOSA time to test code on qastaging, in case it is a site-specific issue. I don't see how it should be, but if I can't reproduce it, I really don't know how to fix it.

Revision history for this message
John A Meinel (jameinel) wrote :

I ran hammer_ssh against qastaging from devpad. Using these options:
$ python ~/hammer_ssh.py --load=200 --delay-after-spawn=0 --run-time=60 --server=bazaar.qastaging.launchpad.net

That attempts to spawn up to 200 concurrent processes all trying to run "echo hello | ssh bazaar.qastaging.launchpad.net" and when ones disconnect, it spawns another one for 60 seconds. The result was:

Spawned 1791 connections in 66.106s
average time: 6.893s

1791 connections in 66.1s means we started 27 connections per second. Since the average run time was 6.9s, that means we averaged 27*6.9=187 concurrent connections.

Not all of these would have had active bzr+ssh processes at the same time. However, it does show that we can spawn thousands of connections over time, and it doesn't appear to be leaking any file handles.

Revision history for this message
Robert Collins (lifeless) wrote :

Can you try running up 400 concurrent processes? Expected behaviour is graceful queuing I guess.

Revision history for this message
John A Meinel (jameinel) wrote :

I did try 300, which gave me 1950 connections in 66s, and slightly slower average time.

I think the issue there is probably the connection handshake overhead. With 1 concurrent connection, the connection time is about 200ms. So we probably can't do much more than 20-40 connections per second.

Revision history for this message
Robert Collins (lifeless) wrote :

So, I'd ramp up with (say) 4 new connections in parallel, and wait for the hello rpc to work before starting another new connection.

We got some new evidence too:

12:52 < spm> erm. is tehre a reason we'd have 49 x /usr/bin/python2.6 /srv/bazaar.launchpad.net/production/launchpad-rev-12351/eggs/bzr-2.2.2_lp_2-py2.6-linux-x86_64.egg/EGG-INFO/scripts/bzr launchpad-forking-service --path
             /srv/bazaar.launchpad.net/var/launchpad_forking_service.socket --pid-file /srv/bazaar.launchpad.net/var/launchpad_forking_service.pid

Suggesting we're failing to clean up, or something.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Michael, do you have any insights into how to diagnose this problem?

Revision history for this message
John A Meinel (jameinel) wrote :

I think Robert mentions it, but I think the problem could certainly be *not* that we are leaking handles, but leaking processes that aren't shutting down when they should.

I saw that a little bit in the bzr-lp-forking-service.log. When the shutdown request came, there were 138 "active" children, but not all of them shut down. I think some of them didn't even shut down when the master process signaled them (SIGINT? SIGTERM? Not sure which). I'm pretty sure the master process will even send SIGKILL if things haven't shut down cleanly by the time it decides to exit.

Though if a sysadmin force-killed the master process, then it wouldn't kill the child processes, etc.

If we can get a soft kill on some of those processes, to see what traceback they put into the log file, that could be pretty helpful. I would try SIGTERM, SIGINT, SIGABRT and maybe SIGQUIT before SIGKILL. SIGQUIT could have problems because it would want to drop into the debugger, but maybe that in itself would give us some info.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I guess it would have been good to have run lsof on the process when it was running out of file handles :/

I also suppose it might be good to call setproctitle in the forking service's children to record some information that way.

Apart from load, which jam seems to have covered, I can't think of any meaningful difference between staging and production here. Maybe we should hold some connections open for longer in the load testing?

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 717345] Re: Updates to SFTP server leak file handles in use_forking_server mode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/15/2011 3:41 PM, Michael Hudson-Doyle wrote:
> I guess it would have been good to have run lsof on the process when it
> was running out of file handles :/
>
> I also suppose it might be good to call setproctitle in the forking
> service's children to record some information that way.

I'm pretty sure it does, since that is done already in the lp-serve
context. I'm not sure why "ps" doesn't see it, but it appears that top
does...

>
> Apart from load, which jam seems to have covered, I can't think of any
> meaningful difference between staging and production here. Maybe we
> should hold some connections open for longer in the load testing?
>

Yeah, I'm working on that in my script now. "--request-count" will set
how many times a given connection says "hello", waits for a response,
and says it again.

I've seen the script start crashing for various reasons (which looks
like bad messages, rejections, etc.), but the script doesn't handle
failure very well (tends to spew 100s of lines to the terminal). So once
I've sorted that out, I'll see what's up.

SPM noticed that there were about 50 'zombie' launchpad-forking-service
processes still running on Crowberry. So I'm guessing they forcibly
killed the master process before the 300s timeout, which then prevented
it from forcibly killing all of its children.

I would really like to have known what state those 50 processes were in.
If they got a failure that wasn't being handled, and somehow left them
in "I'm still running" vs the Conch server, so that it held those
handles open indefinitely.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1a9uUACgkQJdeBCYSNAAOhywCeOFiy1weAEmuVTexZ7YdMPAmm
X7gAni0IP33pxgT3WLXs8+t34T/dq7ZZ
=QcJi
-----END PGP SIGNATURE-----

Revision history for this message
William Grant (wgrant) wrote :

Those launchpad-forking-service processes were probably similar to the ones we just found on tellurium from your load tests: all hung trying to open the fifos, because bzr-sftp didn't have enough file handles to connect. The logs show 30810, 30811 start normally, then the next one run out of handles. Given that the oldest live process is 30812, it is probably the first victim. Nothing from before the file handle exhaustion is still alive.

A few other processes were straced, with every one hung opening one of the fifos.

Revision history for this message
William Grant (wgrant) wrote :

Digging more into the original production incident, I have further insight and also compelling evidence that all 47 forking service processes killed well after the incident were hung the same way as tellurium's.

At 11:01, right before the file handle exhaustion began, 117 forking services were running. Assuming that bzr-sftp needs 3 handles to talk to each child plus 1 for the SSH connection, that should only be around 500 handles. Well below the limit.

At 11:23 there were 140 forking services. 64 children remained from before file handle exhaustion, the oldest from a minute after codehosting was started post-rollout. 48 children were uninitialized -- they all had the original ps, having not yet had the branch path appended.

The master forking service was stopped a few seconds after those counts were taken, at which point it began a 300s timeout waiting for all its children to die. At 11:24 136 remained, but by 11:25 the master had been kill -9'd and this was down to 47. Those same 47 remained for days until a LOSA killed them all, and all were uninitialised, so it looks like every initialised processes died in a timely manner.

Even though some were more than an hour old, the 117 children alive at the time of the major incident should not have been enough to do any damage. The forked children doesn't seem to be *too* badly behaved, as we often have multi-hour processes with the old service too.

Revision history for this message
William Grant (wgrant) wrote :

There are actually 5 handles per connection: the extra is a unix socket to the forking master.

Locally I cannot reproduce the suspected initial leak. However, once the FD limit is reached it starts leaking huge numbers of master sockets and a few child fifos. So an lsof afterwards would probably have had limited utility.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 4:18 AM, William Grant wrote:
> There are actually 5 handles per connection: the extra is a unix socket
> to the forking master.
>
> Locally I cannot reproduce the suspected initial leak. However, once the
> FD limit is reached it starts leaking huge numbers of master sockets and
> a few child fifos. So an lsof afterwards would probably have had limited
> utility.
>

So is this the FD limit in the Conch process? Causing children to be
spawned but not connected to, and then just go dormant. Do we then run
out of system handles, causing the Conch process to continue this cycle?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1b8U4ACgkQJdeBCYSNAAMB0QCgmsOzKubvn1beGHofKQbCB+jt
6SsAni0abIo+9sGJtyMPNwNO+cS8Mk13
=dKwp
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

I'm attaching the script I was using to load test qastaging. It should be usable to load test a local instance.

The way I hammered qastaging to cause failure was with:
$ python ~/hammer_ssh.py --load=200 --delay-after-spawn=1 --run-time=3600 --server=bazaar.qastaging.launchpad.net --request-count=1000

Which is:

1) Try to keep 200 connections active concurrently.
2) Wait 1s between spawning each request, hopefully so that you don't spawn 200, they all finish, you spawn another 200, etc. This could certainly be tweaked to 0.1s, or whatever makes sense.
3) Run the overall script for 1 hour
4) For each process, have it say "hello" 1000 times, waiting for a response after each.

We *should* be able to satisfy these 200 connections without running out of file handles. We can always increase that to 300, etc. Of course the *local* process may have issues with too many file handles as well. I haven't seen anything that I would specifically point at that. It should only be 2 new handles per child (stdin, stdout, child uses stderr from the master.)

It sounds like the Conch server isn't killing off children that get only partially initialized. So say we get a new connection request, and have 2 file handles available.
We will connect to stdin, and stdout, but fail to connect to stderr. At which point, the child process gets hung waiting for us to connect, and the master process decides to go on its merry way. It still *knows* about the child, so it doesn't try to kill it, and it holds stdin and stdout open, in the fruitless hopes that it will be able to talk to the child some day later.

Revision history for this message
John A Meinel (jameinel) wrote :

I can now reproduce this locally, without needing tons of free memory, etc, thanks to the glories of 'ulimit'.

ulimit -n 100
make run_codehosting

hammer_ssh.py
python hammer_ssh.py --load=20 --request-count=200 --server bazaar.launchpad.dev --run-time=20

Starts crashing pretty quickly. I'll attach the lsof output. But basically, it still has a bunch of the handles open. Not only that, but my machine still has a bunch of 'ssh' processes that are still connected, and a fair number of launchpad-forking-service children.

So looking pretty good from the point of being able to reproduce it.

Revision history for this message
John A Meinel (jameinel) wrote :

Note that when terminating the codehosting service, a fair number of the spawned clients decided to wait for child processes to exit. So one of the failures is that we aren't dying cleanly from the spawned clients.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

So, to possibly state the obvious, the problem here was that the conch process ran out of file handles, followed by poor error recovery?

In which case there are two more or less questions: 1) why did conch run out of file handles? 2) can we handle this situation better?

For 1), is it just that there is an extra fd open per connection, or is there a leak? If the former, then we're presumably pretty close to hitting the fd limit in production from time to time!

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 2:35 PM, Michael Hudson-Doyle wrote:
> So, to possibly state the obvious, the problem here was that the conch
> process ran out of file handles, followed by poor error recovery?

Basically.

>
> In which case there are two more or less questions: 1) why did conch run
> out of file handles? 2) can we handle this situation better?

1) because we currently get close to the limit of connections,
   especially right after a major rollout
2) yes

>
> For 1), is it just that there is an extra fd open per connection, or is
> there a leak? If the former, then we're presumably pretty close to
> hitting the fd limit in production from time to time!
>

I'm pretty sure we already run out from time to time, but handle
recovery better.

Here is what I can sort out:

1) We might be using 1 more handle for requesting the fork, but nothing
   I can find is 'leaked'.
2) We are close to running out of handles in production right now.
3) The spawnProcess() code recovers cleanly from running out of
   handles. Causing *that* connection to fail, but future connections
   are unaffected.
4) The LPForkingService code leaks handles when it fails to spawn. So
   while we have 1k handles, once we've started running out of them, we
   have fewer and fewer to work with.
   Which means that we run into the limit more and more until the
   service is completely unusable.
5) The children spawned could clean up slightly better than they do
   today.
   (3) is true because we create the pipes *before* we fork+exec, and
   if we were to run out of handles, we would just never exec in the
   first place. Since we now have to spawn a process before we create
   the handles to connect to it, this layering is broken. One crummy
   option is to hold open a bunch of file handles, and close them when
   we want to fork a new process, etc.

I'm working on some code to clean up the children forked, and then I
want to clean up the twisted code to make sure we don't leak during
failure periods.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cQAsACgkQJdeBCYSNAAOpmgCeJGFAbKYhsCLuc+woQQ5/TqxX
pQ8AoIu2f+voehoIdcxpxeB11Lfhlinj
=oy5y
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/16/2011 2:35 PM, Michael Hudson-Doyle wrote:
> So, to possibly state the obvious, the problem here was that the conch
> process ran out of file handles, followed by poor error recovery?
>
> In which case there are two more or less questions: 1) why did conch run
> out of file handles? 2) can we handle this situation better?
>
> For 1), is it just that there is an extra fd open per connection, or is
> there a leak? If the former, then we're presumably pretty close to
> hitting the fd limit in production from time to time!
>

Looking again, there is one more handle per connection. It is the one we
use to detect the process exiting.

So we now have:

1) network socket from client
2) stdin
3) stderr
4) stdout
5) socket from forking server which will write, eg 'exited 10\n', when
   the child process finally exits.

So we've effectively reduced our peak concurrent requests from about 250
down to about 200.

Coupled with the poor cleanup once we've gotten into that situation, it
gets bad fast.

Note that Andrew has already started work on allowing us to run multiple
Conch processes (and associated forking processes), so that we can
handle no-downtime deployments anyway. Which also helps us with high
availability, etc.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1cRRcACgkQJdeBCYSNAAPQ6gCeM3xSmiLLMcd0/XCPwpM1kJkw
86sAoM3lbLg43aNRMEzf03ZVztKDYS9s
=eUmX
-----END PGP SIGNATURE-----

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

We could also increase the limit.

Revision history for this message
Robert Collins (lifeless) wrote :

"
Note that Andrew has already started work on allowing us to run multiple
Conch processes (and associated forking processes), so that we can
handle no-downtime deployments anyway. Which also helps us with high
availability, etc."

what does this mean? I wouldn't want to run > 1 conch per forking
service, or vice versa. If hes simply talking about having multiple
conch processes with an haproxy front end, we're already ready to do
that; the thing that andrew was working on for a status page is about
graceful handoffs when doing nodowntime deploys (and thats pretty
important in the long-running connection model of bzr).

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

On 17 February 2011 09:27, Robert Collins <email address hidden> wrote:
> "
> Note that Andrew has already started work on allowing us to run multiple
> Conch processes (and associated forking processes), so that we can
> handle no-downtime deployments anyway. Which also helps us with high
> availability, etc."
>
> what does this mean? I wouldn't want to run > 1 conch per forking
> service, or vice versa. If hes simply talking about having multiple
> conch processes with an haproxy front end, we're already ready to do
> that; the thing that andrew was working on for a status page is about
> graceful handoffs when doing nodowntime deploys (and thats pretty
> important in the long-running connection model of bzr).

I think it's just that John sees this feature as "being worked on"
whereas Robert sees it as "ready to deploy it and to do follow on
work." (bug 702024, or RT 40480). It would let us if we wanted
limit each conch to just 256 connections. I don't know of any plans
to have other than one conch per forking server.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

...
>> what does this mean? I wouldn't want to run > 1 conch per forking
>> service, or vice versa. If hes simply talking about having multiple
>> conch processes with an haproxy front end, we're already ready to do
>> that; the thing that andrew was working on for a status page is about
>> graceful handoffs when doing nodowntime deploys (and thats pretty
>> important in the long-running connection model of bzr).
>
> I think it's just that John sees this feature as "being worked on"
> whereas Robert sees it as "ready to deploy it and to do follow on
> work." (bug 702024, or RT 40480). It would let us if we wanted
> limit each conch to just 256 connections. I don't know of any plans
> to have other than one conch per forking server.
>

Right. It isn't in production, thus it is "being worked on".

As an aside, I would probably limit it a bit further than that for
safety margin, etc. But yes, theoretically limiting in the system to
prevent blowout.

We talked about increasing the number of file handles, but when wgrant
tested that on his machine, it caused problems because 'select' doesn't
like more than 1024 handles, and python has some internal MAX_FD set, etc.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1dXMQACgkQJdeBCYSNAANqKwCdFGKBfdwxmAoDTC3QxMwGWWjB
kdoAn2wCobBycdC35cmeTzF0RfU1Nhw7
=/Dni
-----END PGP SIGNATURE-----

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

On Thu, 17 Feb 2011 17:37:08 -0000, John A Meinel <email address hidden> wrote:
> We talked about increasing the number of file handles, but when wgrant
> tested that on his machine, it caused problems because 'select' doesn't
> like more than 1024 handles, and python has some internal MAX_FD set, etc.

If we're close to the select() limit in production, we should probably
be using a different reactor *anyway*.

This translates to passing -r poll or -r epoll to twistd, so I think it
would be something the losas need to change in an init script. And test
on staging first, etc etc.

Cheers,
mwh

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/17/2011 2:02 PM, Michael Hudson-Doyle wrote:
> On Thu, 17 Feb 2011 17:37:08 -0000, John A Meinel <email address hidden> wrote:
>> We talked about increasing the number of file handles, but when wgrant
>> tested that on his machine, it caused problems because 'select' doesn't
>> like more than 1024 handles, and python has some internal MAX_FD set, etc.
>
> If we're close to the select() limit in production, we should probably
> be using a different reactor *anyway*.
>
> This translates to passing -r poll or -r epoll to twistd, so I think it
> would be something the losas need to change in an init script. And test
> on staging first, etc etc.
>
> Cheers,
> mwh
>

Or end-run around it and go with haproxy load between multiple instances
and cap their concurrent requests to about 100 or so. Which has a lot of
other benefits.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1dlV8ACgkQJdeBCYSNAANh3QCghFxN6Y45YfR6u+Y2BuHeLzGY
N0YAoNB/WuE9RXBas+d6vJgOCHIcuoOF
=l/Dl
-----END PGP SIGNATURE-----

Revision history for this message
William Grant (wgrant) wrote :

I tested --reactor poll yesterday and got up to 500 concurrent connections, at which point hammer_ssh ran out of FDs. So it seems to work.

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/17/2011 3:53 PM, William Grant wrote:
> I tested --reactor poll yesterday and got up to 500 concurrent
> connections, at which point hammer_ssh ran out of FDs. So it seems to
> work.
>

Did you also use ulimit to change the number of available fds?

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1emEkACgkQJdeBCYSNAAP5rwCcDKin/gf0bJLo4tCAXUKEV+Am
0iMAn1AW3OB3JfXZ0nWqZOBTaBk4Qruk
=ifiw
-----END PGP SIGNATURE-----

Revision history for this message
William Grant (wgrant) wrote :

On 19/02/11 03:03, John A Meinel wrote:
> On 2/17/2011 3:53 PM, William Grant wrote:
>> I tested --reactor poll yesterday and got up to 500 concurrent
>> connections, at which point hammer_ssh ran out of FDs. So it seems to
>> work.
>
>
> Did you also use ulimit to change the number of available fds?

Ah, yes, I increased it to 10000 in limits.conf.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r12415 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/12415) by a commit, but not testable.

Changed in launchpad:
milestone: none → 11.03
tags: added: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2011-03-10
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments