Bazaar

bzr selftest is leaking threads, eventually bombs with "can't start new thread"

Bug #417053 reported by Denys Duchier on 2009-08-21

This bug report is a duplicate of: Bug #392127: selftest fails with "can't start new thread". Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Confirmed	Low	Unassigned

Bug Description

I cannot run the full selftest. It eventually bombs with "can't start new thread".
I am running bzr.dev, with python 2.6.2, on Gentoo/Linux, on an XPS M1330.
I tried to track the problem down, but I seem to have become lost in spaghetti space :-(
I can garantee that thread's are leaking. I will attach a diff that monitors and instruments
the leaks in the hope that you can reproduce and understand what I am seeing.
Perhaps you can suggest a better way for me to track down the problem.

Tags:

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-21:

monitor and instrument thread leaks Edit (1.3 KiB, text/plain)

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-21:

I should have added that, with the attached patch, I see the problem first occurring here:

blackbox.test_branch.TestBranchStacked.test_branch_stacked OK 124ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_branch_not_stacked OK 341ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_branch_stacked OK 341ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_non_stacked_format OK 73ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_rich_root_non_stackable OK 74ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_smart_server OK 146ms
bzr: ERROR: thread leak: 7 threads at test 82

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-22:

Adding a print loop before raising the complaining exception I see the following threads
that won't die:

...nch.TestBranchStacked.test_branch_stacked_from_smart_server OK 189ms
<_MainThread(MainThread, started -1210341696)>
<Thread(smart-server-child, started daemon -1244996752)>
<Thread(smart-server-child, started daemon -1228211344)>
<Thread(smart-server-child, started daemon -1270174864)>
<Thread(smart-server-child, started daemon -1261782160)>
<Thread(smart-server-child, started daemon -1236604048)>
<Thread(smart-server-child, started daemon -1253389456)>
bzr: ERROR: thread leak: 7 threads at test 18

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-22:

PATCH Edit (1.5 KiB, text/plain)

Some apparent leaks are due to threads not being gc'ed fast enough. I have been able
to get a bit further using the following patch.

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-22:

However the following threads are not going away even after repeated gc.collect():

(Pydbgr) threading.enumerate()
[<_MainThread(MainThread, started -1212365120)>, <Thread(Thread-119, started daemon -1223730320)>, <paramiko.Transport at 0xe9f88ecL (cipher aes128-cbc, 128 bits) (active; 0 open channel(s))>, <paramiko.Transport at 0xa746d0cL (cipher aes128-cbc, 128 bits) (active; 0 open channel(s))>, <Thread(Thread-177, started daemon -1248908432)>, <Thread(Thread-116, started daemon -1257301136)>]

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-08-22:

Thanks for investigating the problem.

Leaked threads are known about, selftest will even report about them (when it can finish that is).
The crash with "can't start new thread" is pretty new though but is also blocking the test suite to run under cygwin.

As you discovered, some threads can be garbage collected, some others can't.
Most of them though are caused by test server threads and we can't easily terminate them due to our
transport objects design. This is known and there may even be a bug filed about adding a close() method for them.

In the mean time, try running less tests with the --starting-with option of by specifying some regexp.
Another work around may be to try https://code.launchpad.net/~vila/bzr/selftest-fixes/ that changes the way --parallel=fork work and spawn more subprocesses with less tests each.

You can even tweak the way the slices are calculated until you reach a point where selftest doesn't bomb anymore.

Changed in bzr:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Martin Pool (mbp) wrote on 2009-08-23: Re: [Bug 417053] Re: bzr selftest is leaking threads, eventually bombs with "can't start new thread"

I wonder what's different on Denys's setup that's making the problem
much worse for him?

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-23:

so far, it seems that on my machine (1) threads (of the selftest) are created faster than they can be gced,
(2) some threads just won't die. I have been able to mitigate (sometimes eradicate) the problems associated
with (1). However (2) is a tough nut to crack, e.g. I have paramiko threads that don't go away. I thought I had
gotten rid of smart-server-child threads, but no... there is one that won't go away.

My machine is an XPS M1330 running Gentoo (~x86, 32bits kernel and userspace - I never made the transition
to 64bits). There seem to be NO limits that could be causing the problem.

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-08-24: Re: [Bug 417053] Re: bzr selftest is leaking threads, eventually bombs with "can't start new thread"

Vincent Ladeuil wrote:
> Thanks for investigating the problem.
>
> Leaked threads are known about, selftest will even report about them (when it
> can finish that is).
> The crash with "can't start new thread" is pretty new though but is also
> blocking the test suite to run under cygwin.
>
> As you discovered, some threads can be garbage collected, some others can't.
> Most of them though are caused by test server threads and we can't easily
> terminate them due to our transport objects design. This is known and there
> may even be a bug filed about adding a close() method for them.

Probably we should change SmartTCPServer to track the smart-server-child threads
it starts and .join them all in stop_background_thread, at least for
SmartTCPServer_for_testing. The tradeoff here is that it might hang in some
circumstances, but I suppose we can use the timeout param of Thread.join to
mitigate that.

Revision history for this message

Denys Duchier (denys.duchier) wrote on 2009-08-24:

#10

PATCH Edit (5.3 KiB, text/plain)

I tried something like that: I added a _close() method to SmartServerSocketStreamMedium

+ def _close(self):
+ self.socket.close()
+ self.finished = True

and kept track of them in the smart server using a weak dict and closed them in
stop_background_thread. Still, I discovered one smart-server-child thread that
would not go away. It appears to be stuck trying to recv from a closed socket:

(Pydbgr) info threads smart-server-child
until_no_eintr(f=<built-in method recv of _socket.socket object at 0xb53ff70>, *a=(65536,), **kw={})

I have verified that the socket is closed. If I attempt a recv in the debugger, I get a bad file descriptor error.

Martin Pool (mbp) on 2010-02-11