There is an initial implementation in the linked branch: lp:~jameinel/bzr/2.1-client-reconnect-819604

I'm starting the work based on 2.1, since that is what is present in LTS Lucid. (Hardy only has 1.3, and I don't think we are going to bother backporting that far.)

ATM, it works only during the write phase. So if we get a ConnectionReset while writing out the request, we assume:

a) If we didn't finish writing before getting ConnectionReset, the server will throw away the request as being incomplete. Thus
b) It is always safe to just retry the request.

I tested it successfully by doing a branch from devpad (using bzr.dev on the server), and sending SIGHUP to the 'bzr serve --inet' process. It shutdown gracefully on devpad, the client logged "hey I got disconnected, I'll retry", and the 'bzr branch' finished.

I did observe that if I did the SIGHUP during the 'get_parent_map' discovery phase, it actually failed during 'read_response' rather than 'write_end'. I'm assuming that is because get_parent_map is happening fast enough the time between reading the previous response and writing the next one was insufficient for the local ssh client to notice that the connection hand been closed.

Compare that with "get_stream()" which has to spend some local time chewing on the final request stream (such as validating all the indexes refer to proper data, etc).

Now, I think the patch as it stands is already helpful, since most times if you are getting SIGHUP, it is going to occur during long-running requests.

The remaining concerns as I see it:

1) I can't re-try 'body_stream' requests at this level. I suppose at the lower level we could see if we got 'ConnectionReset' before we started consuming any of the body_stream. However, the way caching is implemented, we default to buffering 1MB of local content before we send any of it. Which sounds like it is pretty likely to get at least some of body_stream (possibly all).
  a) If we know that we got all the request (we got all the way to _write_end), it seems like we could just re-send our local buffer. (If we also check that there is only one buffer that needs to be sent, etc.)
  b) We change the logic so that ConnectionReset is handled much higher, at least for cases where body_stream is involved. I like handling it at this layer if we can, because it means I don't have to update tons of call sites.

2) Some requests might not be obviously incomplete if you only write some of the request header. However, I'm pretty sure that ProtocolThree prefixes all requests with a number-of-bytes header. I think it can send 'chunked' body content, but if it does so, I'm hoping it clearly identifies "end-of-chunks". I'll try to validate that.
That would validate my assumption that "if we fail during write, we are guaranteed the server will ignore the request".

3) I think some (many, in fact) of our requests are perfectly idempotent. Such as 'get' or 'get_parent_map', etc. I'm not sure yet how to tell whether a given request is ok to retry transparently. For example 'insert_stream' doesn't seem like it should be retried if we fail during reading the response. (Though here it is certainly much more likely to fail during writing, etc.)