bzr hangs (deadlock) on push/pull/merge

Bug #411349 reported by Frits Jalvingh
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Low
Unassigned

Bug Description

(See also bug 405251 comment 34 (https://bugs.launchpad.net/bzr/+bug/405251/comments/34))

Interaction between bzr 1.16 and 1.17 causes deadlocked hangs often. In the call above it happened when I pushed from a 1.17 bzr to a 1.16.1 server; I now have the same problem between a 1.16 client and an 1.17 server doing a merge.

When hanging the client hangs forever in recv when attaching an strace to it. The server hangs forever just doing polls and nothing else. I tried to use different protocols to let the push complete in the earlier call and discovered that bzr+ssh just starts another server remotely but which does not listen to a socket. And in that case I see the client hanging on recv() and the server hanging on recv also.

I did a kill -QUIT on the hanging client trying to merge and issued a 'bt':
jo@seahorse:~/bzr/vp-trunk$ bzr merge bzr://bzr.hosts.itris.nl/vp-trunk
** SIGQUIT received, entering debugger7KB 1KB/s | Fetching
revisions:Inserting stream
** Type 'c' to continue or 'q' to stop the process
** Or SIGQUIT again to quit (and possibly dump core)
> /usr/lib/python2.5/site-packages/bzrlib/breakin.py(33)_debug()
-> signal.signal(signal.SIGQUIT, _debug)
(Pdb) bt
  /usr/bin/bzr(130)<module>()
-> exit_val = bzrlib.commands.main()
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(987)main()
-> ret = run_bzr_catch_errors(argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(998)run_bzr_catch_errors()
-> return exception_to_return_code(run_bzr, argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(729)exception_to_return_code()
-> return the_callable(*args, **kwargs)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(924)run_bzr()
-> ret = run(*run_argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(560)run_argv_aliases()
-> return self.run(**all_cmd_args)
  /usr/lib/python2.5/site-packages/bzrlib/builtins.py(3573)run()
-> location, revision, remember, possible_transports, pb)
  /usr/lib/python2.5/site-packages/bzrlib/builtins.py(3680)_get_merger_from_branch()
-> other_revision_id, base_revision_id, other_branch, base_branch)
  /usr/lib/python2.5/site-packages/bzrlib/merge.py(204)from_revision_ids()
-> merger.set_other_revision(other, other_branch)
  /usr/lib/python2.5/site-packages/bzrlib/merge.py(345)set_other_revision()
-> self._maybe_fetch(other_branch, self.this_branch, self.other_rev_id)
  /usr/lib/python2.5/site-packages/bzrlib/merge.py(362)_maybe_fetch()
-> target.fetch(source, revision_id)
  /usr/lib/python2.5/site-packages/bzrlib/decorators.py(192)write_locked()
-> result = unbound(self, *args, **kwargs)
  /usr/lib/python2.5/site-packages/bzrlib/branch.py(532)fetch()
-> pb=pb)
  /usr/lib/python2.5/site-packages/bzrlib/repository.py(1552)fetch()
-> find_ghosts=find_ghosts, fetch_spec=fetch_spec)
  /usr/lib/python2.5/site-packages/bzrlib/decorators.py(192)write_locked()
-> result = unbound(self, *args, **kwargs)
  /usr/lib/python2.5/site-packages/bzrlib/repository.py(3126)fetch()
-> pb=pb, find_ghosts=find_ghosts)
  /usr/lib/python2.5/site-packages/bzrlib/fetch.py(82)__init__()
-> self.__fetch()
  /usr/lib/python2.5/site-packages/bzrlib/fetch.py(108)__fetch()
-> self._fetch_everything_for_search(search)
  /usr/lib/python2.5/site-packages/bzrlib/fetch.py(136)_fetch_everything_for_search()
-> stream, from_format, [])
  /usr/lib/python2.5/site-packages/bzrlib/repository.py(4029)insert_stream()
-> return self._locked_insert_stream(stream, src_format, is_resume)
  /usr/lib/python2.5/site-packages/bzrlib/repository.py(4058)_locked_insert_stream()
-> for substream_type, substream in stream:
  /usr/lib/python2.5/site-packages/bzrlib/remote.py(1763)missing_parents_chain()
-> for kind, stream in self._get_stream(sources[0], search):
  /usr/lib/python2.5/site-packages/bzrlib/smart/repository.py(467)record_stream()
-> for bytes in byte_stream:
  /usr/lib/python2.5/site-packages/bzrlib/smart/message.py(334)read_streamed_body()
-> self._read_more()
  /usr/lib/python2.5/site-packages/bzrlib/smart/message.py(277)_read_more()
-> bytes = self._medium_request.read_bytes(next_read_size)
  /usr/lib/python2.5/site-packages/bzrlib/smart/medium.py(456)read_bytes()
-> return self._read_bytes(count)
  /usr/lib/python2.5/site-packages/bzrlib/smart/medium.py(468)_read_bytes()
-> return self._medium.read_bytes(count)
  /usr/lib/python2.5/site-packages/bzrlib/smart/medium.py(149)read_bytes()
-> return self._read_bytes(bytes_to_read)
  /usr/lib/python2.5/site-packages/bzrlib/smart/medium.py(885)_read_bytes()
-> self._socket.recv, count, self._report_activity)
  /usr/lib/python2.5/site-packages/bzrlib/smart/medium.py(932)_read_bytes_from_socket()
-> bytes = osutils.until_no_eintr(sock, _MAX_READ_SIZE)
  /usr/lib/python2.5/site-packages/bzrlib/osutils.py(1785)until_no_eintr()
-> return f(*a, **kw)
> /usr/lib/python2.5/site-packages/bzrlib/breakin.py(33)_debug()
-> signal.signal(signal.SIGQUIT, _debug)
(Pdb)

the same on the server killed the server.

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :
Download full text (5.6 KiB)

!!!!!!!!!!!!!
It now hangs between 2 1.17 versions too!
!!!!!!!!!!!!!

The server bt after a kill -QUIT:
-bash-3.00$ bzr serve --directory /media/d02/bzr --allow-writes
listening on port: 4155
** SIGQUIT received, entering debugger
** Type 'c' to continue or 'q' to stop the process
** Or SIGQUIT again to quit (and possibly dump core)
> /usr/local/lib/python2.4/site-packages/bzrlib/breakin.py(33)_debug()
-> signal.signal(signal.SIGQUIT, _debug)
(Pdb) bt
  /usr/local/bin/bzr(142)?()
-> exit_val = bzrlib.commands.main()
  /usr/local/lib/python2.4/site-packages/bzrlib/commands.py(1108)main()
-> ret = run_bzr_catch_errors(argv)
  /usr/local/lib/python2.4/site-packages/bzrlib/commands.py(1120)run_bzr_catch_errors()
-> return exception_to_return_code(run_bzr, argv)
  /usr/local/lib/python2.4/site-packages/bzrlib/commands.py(835)exception_to_return_code()
-> return the_callable(*args, **kwargs)
  /usr/local/lib/python2.4/site-packages/bzrlib/commands.py(1030)run_bzr()
-> ret = run(*run_argv)
  /usr/local/lib/python2.4/site-packages/bzrlib/commands.py(647)run_argv_aliases()
-> return self.run(**all_cmd_args)
  /usr/local/lib/python2.4/site-packages/bzrlib/builtins.py(4725)run()
-> protocol(transport, host, port, inet)
  /usr/local/lib/python2.4/site-packages/bzrlib/smart/server.py(342)serve_bzr()
-> smart_server.serve()
  /usr/local/lib/python2.4/site-packages/bzrlib/smart/server.py(121)serve()
-> conn, client_addr = self._server_socket.accept()
> /usr/local/lib/python2.4/site-packages/bzrlib/breakin.py(33)_debug()
-> signal.signal(signal.SIGQUIT, _debug)
(Pdb)

The kill -QUIT with bt on the client at this point:
jo@seahorse:~/bzr/vp-trunk$ bzr merge bzr://bzr.hosts.itris.nl/vp-trunk
** SIGQUIT received, entering debuggerB/s | Fetching revisions:Inserting
stream
** Type 'c' to continue or 'q' to stop the process
** Or SIGQUIT again to quit (and possibly dump core)
> /usr/lib/python2.5/site-packages/bzrlib/breakin.py(33)_debug()
-> signal.signal(signal.SIGQUIT, _debug)
(Pdb) bt
  /usr/bin/bzr(142)<module>()
-> exit_val = bzrlib.commands.main()
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(1108)main()
-> ret = run_bzr_catch_errors(argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(1120)run_bzr_catch_errors()
-> return exception_to_return_code(run_bzr, argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(835)exception_to_return_code()
-> return the_callable(*args, **kwargs)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(1030)run_bzr()
-> ret = run(*run_argv)
  /usr/lib/python2.5/site-packages/bzrlib/commands.py(647)run_argv_aliases()
-> return self.run(**all_cmd_args)
  /usr/lib/python2.5/site-packages/bzrlib/builtins.py(3683)run()
-> location, revision, remember, possible_transports, pb)
  /usr/lib/python2.5/site-packages/bzrlib/builtins.py(3790)_get_merger_from_branch()
-> other_revision_id, base_revision_id, other_branch, base_branch)
  /usr/lib/python2.5/site-packages/bzrlib/merge.py(204)from_revision_ids()
-> merger.set_other_revision(other, other_branch)
  /usr/lib/python2.5/site-packages/bzrlib/merge.py(343)set_other_revision()
-> self._maybe_fetch(other_branch, self.this_branch, self.other_rev_id)
  /usr/lib/python2...

Read more...

summary: - bzr hangs on push/pull/merge between 1.16 and 1.17
+ bzr hangs (deadlock) on push/pull/merge
Revision history for this message
Andrew Bennetts (spiv) wrote :

Unfortunately SIGQUIT won't be readily useful on TCP "bzr serve" servers, because the interesting thread is not the main thread, and I don't think pdb provides any way to get a backtrace for a non-main thread. The server is definitely the interesting side of this; the client is simply waiting for the server to send it more data.

An strace of the server process may be more revealing. Or, if you have gdb installed on the server (and maybe the python debugging symbols too?) you can try using the gdbinit file linked from http://wiki.python.org/moin/DebuggingWithGdb and use the "pystack" macro on the appropriate thread. Actually, even a regular "thread apply all bt" in gdb might help.

Depending on what has gone wrong, there might be a backtrace lurking in ~/.bzr.log on the server too, although in this case I'm guessing there won't be. (If the server is behaving well enough to log errors, it should be behaving well enough to return an error to the client too.)

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :

I have an strace of the server *after* it hangs, not before it:
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAKE, 1) = 1
poll([{fd=4, events=POLLIN}], 1, 1000) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAKE, 1) = 1
poll([{fd=4, events=POLLIN}], 1, 1000) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAKE, 1) = 1
poll([{fd=4, events=POLLIN}], 1, 1000) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
futex(0x937a5a8, FUTEX_WAKE, 1) = 1
poll([{fd=4, events=POLLIN}], 1, 1000) = 0

In effect it is not doing anything. I'll investigate if it is possible to do a gdb; this server runs Red Head 4 (I always get a red head because it is such an unfriendly horrible Linux distro) and I had to compile Python by hand there - RHEL 3 has python 2.3 as base.

Revision history for this message
Andrew Bennetts (spiv) wrote : Re: [Bug 411349] Re: bzr hangs (deadlock) on push/pull/merge

Frits Jalvingh wrote:
> I have an strace of the server *after* it hangs, not before it:
> futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
> futex(0x937a5a8, FUTEX_WAIT, 0, NULL) = 0
> futex(0x937a5a8, FUTEX_WAKE, 1) = 1
> poll([{fd=4, events=POLLIN}], 1, 1000) = 0
[etc]
> In effect it is not doing anything.

I think that's probably just the main thread, again. fd 4 seems likely to be
the listening socket (I'd expect 0, 1, 2 are stdio, and 3 to be the ~/.bzr.log).
So this thread is just polling the listening socket for new incoming
connections, I think, which is normal and correct.

Try looking in /proc/<pid>/task for PIDs of the thread(s), and stracing those.

Thanks for your persistence!

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :

> I think that's probably just the main thread....
Yes, I know; I just gave it as an indication that it seems to do nothing at all.

> Try looking in /proc/<pid>/task for PIDs of the thread(s), and stracing those.
Good tip, I'll try to do that the next time it hangs.

> Thanks for your persistence!
Thanks *you* for helping!

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :

I now have a server process that seems to idle, and /proc/[pid]/task/ only contains a single pid (the main thread). The log states:
[24839] 2009-08-11 13:59:05.109 WARNING: listening socket error: (11, 'Resource temporarily unavailable')
[24839] 2009-08-11 13:59:30.820 WARNING: listening socket error: (11, 'Resource temporarily unavailable')

In the meantime the client stays blocked on a recvfrom(4,...)

I logged open connections and found something very odd: the client is waiting on a connection it thinks is open:
bzr 21626 jal 4u IPv4 1165611 TCP 172.22.128.28:44635->192.168.0.110:4155 (ESTABLISHED)

The server has no task running, and actually does not have this connection anymore:
[root@mouse ~]# netstat -n | grep \.28\:
tcp 0 0 192.168.0.110:1521 172.22.126.28:1702 ESTABLISHED
tcp 0 0 192.168.0.110:1521 172.22.126.28:2176 ESTABLISHED
tcp 0 0 192.168.0.110:1521 172.22.126.28:1516 ESTABLISHED
tcp 0 0 192.168.0.110:1521 172.22.126.28:2174 ESTABLISHED
tcp 0 0 192.168.0.110:1521 172.22.126.28:2171 ESTABLISHED
tcp 0 0 192.168.0.110:1521 172.22.126.28:1127 ESTABLISHED
tcp 0 0 192.168.0.110:445 172.22.126.28:1094 ESTABLISHED
tcp 0 0 192.168.0.110:445 172.22.126.28:1098 ESTABLISHED
tcp 0 512 ::ffff:192.168.0.110:22 ::ffff:172.22.128.28:34085 ESTABLISHED

Either the connection has been idle for so long that the firewall caused a close or!?

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :
Download full text (7.7 KiB)

I did a full strace on the server thread for such a run. It seems to do a lot of reading for a long, long time; then it looks like it is dying:
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8bf03a8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0

lots of repeats of this, with intermittent something like:
fstat64(5, {st_mode=S_IFREG|0664, st_size=27403786, ...}) = 0
mmap2(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x931a4000
_llseek(5, 0, [0], SEEK_SET) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
read(5, "Bazaar pack format 1 (introduced"..., 32768) = 32768
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
_llseek(5, 9961472, [9961472], SEEK_SET) = 0
read(5, "\331b\261\241\0\0\0B122\n\n\37\213\10\0\0\0\0\0\2\377\235"..., 32768) = 32768

Then near the end it goes:
futex(0x8bf03a8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8bf03a8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
mmap2(NULL, 5746688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x8bea9000
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
send(6, "b\0\0\3|B874\ninventories\n\nknit-delt"..., 65536, 0) = 65536
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
send(6, "\313z\263|\304\325\372\31\363\242.r\17B{\2\376\355\214"..., 65536, 0) = 29648
send(6, "\242&\303}\335MV6\311&\207GJ\257\373\362\357\207\7\227"..., 35888, 0) = -1 ECONNRESET (Connection reset by peer)
futex(0x8ac2260, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/protocol.py", {st_mode=S_IFREG|0644, st_size=50748, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/message.py", {st_mode=S_IFREG|0644, st_size=13718, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/protocol.py", {st_mode=S_IFREG|0644, st_size=50748, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/protocol.py", {st_mode=S_IFREG|0644, st_size=50748, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/protocol.py", {st_mode=S_IFREG|0644, st_size=50748, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/protocol.py", {st_mode=S_IFREG|0644, st_size=50748, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/smart/medium.py", {st_mode=S_IFREG|0644, st_size=35183, ...}) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
futex(0x8c96cf8, FUTEX_WAKE, 1) = 0
stat64("/usr/local/lib/python2.4/site-packages/bzrlib/osutils.py", {st_mode=S_IFREG|0644, st_size=61815, ...}) = 0
futex(0x8c96...

Read more...

Vincent Ladeuil (vila)
Changed in bzr:
status: New → Confirmed
Martin Pool (mbp)
Changed in bzr:
importance: Undecided → Low
tags: added: hpss
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.