blackbox.test_check.ChrootedCheckTests.test_check_missing_branch hangs on AIX

Bug #405745 reported by Cris Boylan on 2009-07-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Medium
Vincent Ladeuil

Bug Description

python version: 2.6.2
bzr version: 1.17
OS: AIX 5.3

when running bzr selftest it gets to test 112 (bzrlib.tests.blackbox.test_check.ChrootedCheckTests.test_check_missing_branch) and then hangs.

from the bzr.log i can see its running bzr check --branch http://localhost:47422 and returning "ERROR no branch found at specified location" - which seems to be the point of the test.

i can then see from netstat that it seems to still be listening on that port but when i try bzr check myself i get "Unable to handle http code 504: Gateway Time-out" should it still be listening at this point?

seems to me like its not clearing up the connection after the test is successful. I am willing to help figure this one out, but just need a few pointers on where to look.

Related branches

Andrew Bennetts (spiv) wrote :

Strange. Perhaps try running just the http tests, to see if they show the problem? "bzr selftest -s bt.test_http".

Also, try sending SIGQUIT (hit Ctrl-\, usually) and then type "bt" at the Pdb prompt and add the traceback to this bug, so we can see where the main thread is stuck.

Cris Boylan (crispin-boylan) wrote :

the http tests hang as well, attaching that backtrace also.

Vincent Ladeuil (vila) wrote :

Weird.

Did you notice a change in the behavior recently or is it the first time you try running selftest ?

The traceback points at a socket.close() which run after a socket.shutdown(socket.SHUT_RDWR)....

So I can hardly imagine another way to shut down the server and I can't imagine why that fails on AIX...
but if that's the case, we'll have to turn that bug into a duplicate of #417053...

Cris Boylan (crispin-boylan) wrote :

hangs after only 2 tests.

shaun:/home/crisb->bzr selftest -s bt.test_http -v
running 406 tests...
testing: /opt/freeware/bin/bzr
   /opt/freeware/lib/python2.6/site-packages/bzrlib (1.17 python2.6.2)

test_http.TestHttpTransportRegistration.test_http_registered(urllib) OK 77ms
test_http.TestHttpTransportUrls.test_abs_url(urllib) OK 23ms
test_http.TestHttpTransportUrls.test_http_impl_urls(urllib)

Vincent Ladeuil (vila) wrote :

More info from IRC discussions: Cris did some test and IAX doesn't blatantly refuse to close() after a shutdown().
The situation is not recent, presumably the hang has happened for a long time.
Cris built python itself and didn't think he used controversial settings:
'few patches for aix weirdness of building shared libraries, but nothing really naughty'

Changed in bzr:
status: New → Confirmed
importance: Undecided → Medium
Vincent Ladeuil (vila) wrote :

By the way:
test_http.TestHttpTransportUrls.test_http_impl_urls(urllib)

does almost nothing: start the server, stop the server, nobody connects in between...

2009/8/31 Vincent Ladeuil <email address hidden>:
> More info from IRC discussions: Cris did some test and IAX doesn't blatantly refuse to close() after a shutdown().

I don't understand "doesn't blatantly refuse to", is that a double negative?

iirc the behaviour of shutdown() actually varies quite a lot across
different unixes. I'd need to check a book to be sure. Perhaps we
should just close().

What more action can we take on this bug?

--
Martin <http://launchpad.net/~mbp/>

Vincent Ladeuil (vila) wrote :

> I don't understand "doesn't blatantly refuse to",
Cris said close() worked after shutdown() in some basic python tests he did.
> What more action can we take on this bug?
Cris said he will try to isolate which tests are hanging.

Vincent Ladeuil (vila) wrote :

> the behaviour of shutdown() actually varies quite a lot across different unixes

:-/ At least on Linux and OSX, the shutdown() in the test http servers was mandatory...

I'm all ears about an alternate way to put a server socket in a state where it can be closed (really closed)
even if a client is connected.

Cris Boylan (crispin-boylan) wrote :

few more tests done:

a) almost every test in the bt.test_http causes the hang in close().

b) ran python test on my python build and both the sockets and the threading tasks passed.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Weird.
>
> Did you notice a change in the behavior recently or is it the first time
> you try running selftest ?
>
> The traceback points at a socket.close() which run after a
> socket.shutdown(socket.SHUT_RDWR)....
>
> So I can hardly imagine another way to shut down the server and I can't imagine why that fails on AIX...
> but if that's the case, we'll have to turn that bug into a duplicate of #417053...

One possibility is that the client is still sending data -
http://httpd.apache.org/docs/1.3/misc/fin_wait_2.html#appendix

changing our server code to:
shutdown(WR)
while (content = read_bytes()):
    pass
shutdown(RD)
close()

may help.

- -Rob
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkqgTvAACgkQ42zgmrPGrq7P5ACgj8PtE6RvsoTO8Eu98hBu6sNo
L/MAn0kx85t5nNBEzhW5WaNLIBGnVsl9
=6oIp
-----END PGP SIGNATURE-----

I checked 'Effective TCP/IP Programming.' It says that the behaviour
of shutdown(2) varies not between BSD and SysV unices (as I
misremembered) but between Unix and Windows. To make a long story
short, on Unix shutdown normally drains the buffer whereas on Windows
it causes the other party to get connection reset if more data
arrives.

So it's probably not the cause of aix-specific differences.

--
Martin <http://launchpad.net/~mbp/>

Cris Boylan (crispin-boylan) wrote :

have tried the suggestion by robert collins, but no luck the behaviour is the same.

have also tried removing the shutdown entirely, no dice.

the only thing that makes it work is removing the close().

strangely when I attach in dbx one thread is stuck in select() and one in the socket deallocation. when I run cont() the select receives an interrupted system call signal and completes, then the test completes successfully. could this be relevant?

Vincent Ladeuil (vila) wrote :

I'd like to see the select() backtrace...

And try to use test_http.TestHttpTransportUrls.test_http_impl_urls(urllib) since we know it doesn't connect to the server.

Otherwise, when you attach with dbx, you probably interrupt some system call (listen() may be)
and that may be enough to unblock the hanging thread.

Cris Boylan (crispin-boylan) wrote :

the dbx output is:

(dbx) where
__fd_select(??, ??, ??, ??, ??) at 0xd042ebe0
selectmodule.select() at 0x20aa7b6c
select_select() at 0x20aa7260
PyCFunction_Call() at 0x200cb630
call_function() at 0x2008c53c
PyEval_EvalFrameEx() at 0x200917ec
fast_function() at 0x2008c058
call_function() at 0x2008c608
PyEval_EvalFrameEx() at 0x200917ec
PyEval_EvalCodeEx() at 0x2008d4f0
function_call() at 0x200aabfc
PyObject_Call() at 0x20076308
ext_do_call() at 0x2008baa0
PyEval_EvalFrameEx() at 0x20091ed0
fast_function() at 0x2008c058
call_function() at 0x2008c608
PyEval_EvalFrameEx() at 0x200917ec
fast_function() at 0x2008c058
call_function() at 0x2008c608
PyEval_EvalFrameEx() at 0x200917ec
PyEval_EvalCodeEx() at 0x2008d4f0
function_call() at 0x200aabfc
PyObject_Call() at 0x20076308
instancemethod_call() at 0x200769b4
PyObject_Call() at 0x20076308
PyEval_CallObjectWithKeywords() at 0x20092b8c
t_bootstrap() at 0x200c9bf0

Cris Boylan (crispin-boylan) wrote :

thats's from test_http.TestHttpTransportUrls.test_http_impl_urls(urllib)

Vincent Ladeuil (vila) wrote :

Ha, hmm, sorry, I thought about a python backtrace :-/
I don't quite get where the http client can be blocked on a select...

Cris Boylan (crispin-boylan) wrote :

after some chats in IRC we realised the 2 threads are main thread and server thread as there is no client.

AIX does not seem to like 2 threads trying operations on the socket at the same time (select and close). This causes the hang. The reason the socket is still there is because shutdown is returning ENOTCONN and not actually doing any shutting down.

replacing the code in TestingHTTPServerMixin.tearDown with (suggested by your good self on IRC):

         self._http_running = False

         temp_socket = socket.create_connection(self.socket.getsockname())
         temp_socket.close()

         # Let the server properly close the socket
         self.server_close()

now the results of bzr selftest -s bt.test_http -v:

----------------------------------------------------------------------
Ran 406 tests in 98.015s

OK
14 tests skipped
tests passed
bzrlib.tests.test_http.TestHttpTransportUrls.test_http_impl_urls(urllib) is leaking threads among 183 leaking tests.

Cris Boylan (crispin-boylan) wrote :

Added a branch for this fix

Vincent Ladeuil (vila) on 2009-10-07
Changed in bzr:
assignee: nobody → Vincent Ladeuil (vila)
Vincent Ladeuil (vila) on 2009-10-07
Changed in bzr:
status: Confirmed → Fix Committed
Vincent Ladeuil (vila) wrote :

From pqm failure:
- the fix requires python-2.6,
- still address the issue for 2.5,
- hangs again for 2.4

Changed in bzr:
status: Fix Committed → In Progress
Vincent Ladeuil (vila) wrote :

See also bug #392127 for more details on the work in progress.

Vincent Ladeuil (vila) on 2010-08-30
Changed in bzr:
milestone: none → 2.3b1
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers