bzr selftest is leaking threads, eventually bombs with "can't start new thread"

Bug #417053 reported by Denys Duchier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Low
Unassigned

Bug Description

I cannot run the full selftest. It eventually bombs with "can't start new thread".
I am running bzr.dev, with python 2.6.2, on Gentoo/Linux, on an XPS M1330.
I tried to track the problem down, but I seem to have become lost in spaghetti space :-(
I can garantee that thread's are leaking. I will attach a diff that monitors and instruments
the leaks in the hope that you can reproduce and understand what I am seeing.
Perhaps you can suggest a better way for me to track down the problem.

Tags: selftest
Revision history for this message
Denys Duchier (denys.duchier) wrote :
Revision history for this message
Denys Duchier (denys.duchier) wrote :

I should have added that, with the attached patch, I see the problem first occurring here:

blackbox.test_branch.TestBranchStacked.test_branch_stacked OK 124ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_branch_not_stacked OK 341ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_branch_stacked OK 341ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_non_stacked_format OK 73ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_rich_root_non_stackable OK 74ms
blackbox.test_branch.TestBranchStacked.test_branch_stacked_from_smart_server OK 146ms
bzr: ERROR: thread leak: 7 threads at test 82

Revision history for this message
Denys Duchier (denys.duchier) wrote :

Adding a print loop before raising the complaining exception I see the following threads
that won't die:

...nch.TestBranchStacked.test_branch_stacked_from_smart_server OK 189ms
<_MainThread(MainThread, started -1210341696)>
<Thread(smart-server-child, started daemon -1244996752)>
<Thread(smart-server-child, started daemon -1228211344)>
<Thread(smart-server-child, started daemon -1270174864)>
<Thread(smart-server-child, started daemon -1261782160)>
<Thread(smart-server-child, started daemon -1236604048)>
<Thread(smart-server-child, started daemon -1253389456)>
bzr: ERROR: thread leak: 7 threads at test 18

Revision history for this message
Denys Duchier (denys.duchier) wrote :

Some apparent leaks are due to threads not being gc'ed fast enough. I have been able
to get a bit further using the following patch.

Revision history for this message
Denys Duchier (denys.duchier) wrote :

However the following threads are not going away even after repeated gc.collect():

(Pydbgr) threading.enumerate()
[<_MainThread(MainThread, started -1212365120)>, <Thread(Thread-119, started daemon -1223730320)>, <paramiko.Transport at 0xe9f88ecL (cipher aes128-cbc, 128 bits) (active; 0 open channel(s))>, <paramiko.Transport at 0xa746d0cL (cipher aes128-cbc, 128 bits) (active; 0 open channel(s))>, <Thread(Thread-177, started daemon -1248908432)>, <Thread(Thread-116, started daemon -1257301136)>]

Revision history for this message
Vincent Ladeuil (vila) wrote :

Thanks for investigating the problem.

Leaked threads are known about, selftest will even report about them (when it can finish that is).
The crash with "can't start new thread" is pretty new though but is also blocking the test suite to run under cygwin.

As you discovered, some threads can be garbage collected, some others can't.
Most of them though are caused by test server threads and we can't easily terminate them due to our
transport objects design. This is known and there may even be a bug filed about adding a close() method for them.

In the mean time, try running less tests with the --starting-with option of by specifying some regexp.
Another work around may be to try https://code.launchpad.net/~vila/bzr/selftest-fixes/ that changes the way --parallel=fork work and spawn more subprocesses with less tests each.

You can even tweak the way the slices are calculated until you reach a point where selftest doesn't bomb anymore.

Changed in bzr:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Martin Pool (mbp) wrote : Re: [Bug 417053] Re: bzr selftest is leaking threads, eventually bombs with "can't start new thread"

I wonder what's different on Denys's setup that's making the problem
much worse for him?

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Denys Duchier (denys.duchier) wrote :

so far, it seems that on my machine (1) threads (of the selftest) are created faster than they can be gced,
(2) some threads just won't die. I have been able to mitigate (sometimes eradicate) the problems associated
with (1). However (2) is a tough nut to crack, e.g. I have paramiko threads that don't go away. I thought I had
gotten rid of smart-server-child threads, but no... there is one that won't go away.

My machine is an XPS M1330 running Gentoo (~x86, 32bits kernel and userspace - I never made the transition
to 64bits). There seem to be NO limits that could be causing the problem.

Revision history for this message
Andrew Bennetts (spiv) wrote : Re: [Bug 417053] Re: bzr selftest is leaking threads, eventually bombs with "can't start new thread"

Vincent Ladeuil wrote:
> Thanks for investigating the problem.
>
> Leaked threads are known about, selftest will even report about them (when it
> can finish that is).
> The crash with "can't start new thread" is pretty new though but is also
> blocking the test suite to run under cygwin.
>
> As you discovered, some threads can be garbage collected, some others can't.
> Most of them though are caused by test server threads and we can't easily
> terminate them due to our transport objects design. This is known and there
> may even be a bug filed about adding a close() method for them.

Probably we should change SmartTCPServer to track the smart-server-child threads
it starts and .join them all in stop_background_thread, at least for
SmartTCPServer_for_testing. The tradeoff here is that it might hang in some
circumstances, but I suppose we can use the timeout param of Thread.join to
mitigate that.

Revision history for this message
Denys Duchier (denys.duchier) wrote :

I tried something like that: I added a _close() method to SmartServerSocketStreamMedium

+ def _close(self):
+ self.socket.close()
+ self.finished = True

and kept track of them in the smart server using a weak dict and closed them in
stop_background_thread. Still, I discovered one smart-server-child thread that
would not go away. It appears to be stuck trying to recv from a closed socket:

(Pydbgr) info threads smart-server-child
    until_no_eintr(f=<built-in method recv of _socket.socket object at 0xb53ff70>, *a=(65536,), **kw={})

I have verified that the socket is closed. If I attempt a recv in the debugger, I get a bad file descriptor error.

Martin Pool (mbp)
tags: added: selftest
Changed in bzr:
importance: High → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.