Comment 11 for bug 819604

Revision history for this message
John A Meinel (jameinel) wrote :

In my simple testing, I actually found that we don't discover we are disconnected until we *read* from the closed socket. I'm guessing the issue is that the write buffer is large enough vs the amount of content we are writing ('get', 'filename') that it just gets buffered until we wait to read back the response, and then the local socket notices that the remote one closed its end. I even tried an early socket.shutdown(SHUT_RD) on the server side, but still the only failure I got was during "_read_more."

This certainly makes it a bit more clumsy. Also, we default to buffering ~1MB (until the end of 'this write') before we actually write anything onto the socket.

All the layers and indirection, not to mention that sometimes we are writing to SSH, sometimes to 'simple pipes', sometimes to a socket, sometimes over HTTP, etc, etc.

I'm focusing on "(get)" as a first draft, because after doing "get_record_stream()" the first thing (at least older clients) do is probe the remote object for the real format, which involves reading .bzr/branch-format directly.

And I'm focusing on 'first thing after stream', because that is likely to be the bulk of the time spent, so it is the most likely time that we would end up interrupted. (And the most frustrating, since you'll finish 10min of data transfer, only to be interrupted checking if there are tags you might need.) An example traceback (bzr-2.1):

...
  File "C:\dev\bzr\bzr.dev\bzrlib\fetch.py", line 103, in __fetch
    self._fetch_everything_for_search(search_result)
  File "C:\dev\bzr\bzr.dev\bzrlib\fetch.py", line 134, in _fetch_everything_for_search
    stream = source.get_stream_for_missing_keys(missing_keys)
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 2197, in get_stream_for_missing_keys
    self.from_repository._ensure_real()
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 1089, in _ensure_real
    self.bzrdir._ensure_real()
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 410, in _ensure_real
    self.root_transport, _server_formats=False)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 858, in open_from_transport
    redirected)
  File "C:\dev\bzr\bzr.dev\bzrlib\transport\__init__.py", line 1683, in do_catching_redirections
    return action(transport)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 845, in find_format
    transport, _server_formats=_server_formats)
  File "C:\dev\bzr\bzr.dev\bzrlib\controldir.py", line 769, in find_format
    return prober.probe_transport(transport)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 1346, in probe_transport
    format_string = transport.get_bytes(".bzr/branch-format")
  File "C:\dev\bzr\bzr.dev\bzrlib\transport\remote.py", line 226, in get_bytes
    resp, response_handler = self._client.call_expecting_body('get', remote)
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\client.py", line 145, in call_expecting_body
    method, args, expect_response_body=True)
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\client.py", line 81, in _call_and_read_response
    expect_body=expect_response_body),
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\message.py", line 299, in read_response_tuple
    self._wait_for_response_args()
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\message.py", line 264, in _wait_for_response_args
    self._read_more()
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\message.py", line 286, in _read_more
    "Unexpected end of message. "
ConnectionReset: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.

I might focus on newer clients, but Lucid is still in support, and has bzr 2.1. Note that in the above case, I manually did a "SIGHUP" to the server as soon as get_stream started, about 150s before the stream stopped. Now, that particular code was before I did socket.shutdown(SHUT_RD). Which may act differently. However, in a test case, after sending a request, sending the shutdown, reading the request finish, seeing the server stopped serving the connection, and then trying another request, I *still* saw the failure during the read:
  File "C:\dev\bzr\work\bzrlib\tests\test_smart_transport.py", line 1279, in test_graceful_shutdown_waits_for_clients_to
_stop
    'bigfile')
  File "C:\dev\bzr\work\bzrlib\smart\client.py", line 145, in call_expecting_body
    method, args, expect_response_body=True)
  File "C:\dev\bzr\work\bzrlib\smart\client.py", line 81, in _call_and_read_response
    expect_body=expect_response_body),
  File "C:\dev\bzr\work\bzrlib\smart\message.py", line 299, in read_response_tuple
    self._wait_for_response_args()
  File "C:\dev\bzr\work\bzrlib\smart\message.py", line 264, in _wait_for_response_args
    self._read_more()
  File "C:\dev\bzr\work\bzrlib\smart\message.py", line 286, in _read_more
    "Unexpected end of message. "
ConnectionReset: Connection closed: Unexpected end of message. Please check connectivity and permissions, and report a bug if problems persist.

Maybe we are accidentally suppressing a write failure from:
class SmartTCPClientMedium(SmartClientStreamMedium):
...
    def _accept_bytes(self, bytes):
        """See SmartClientMedium.accept_bytes."""
        self._ensure_connection()
        osutils.send_all(self._socket, bytes, self._report_activity)

Anyway, this certainly complicates things (IMO). If we can't detect at write time that the connection is closed, then we don't actually know whether the server got the request, and we just failed to read the response. Which means that we can only retry requests which are idempotent. If we saw a write failure, then we could say "oh, we didn't send a complete message, so clearly we should retry".

I would be ok if we just couldn't retry "call_with_body_stream()". Though it is a shame that if the connection is closed just before we start streaming, then we start streaming only to find the connection closed, we can't just try again.

Suggestions welcome. Especially if I'm just measuring something wrong. Though especially for stuff like "bzr+ssh" there are too many layers of buffering for me to feel like we can reliably detect a write failure. (local buffering, pipe buffering, SSH buffering, network buffering, etc)