Comment 6 for bug 819604

Revision history for this message
Andrew Bennetts (spiv) wrote : Re: [Bug 819604] Re: when an idle ssh transport is interrupted, bzrlib errors

Martin Pool wrote:
> We can't just reconnect at arbitrary points in the protocol, only at the
> start of a request. The medium itself doesn't know about request
> boundaries so the higher-level client needs to be involved.

Actually, SmartClientMedium does know about requests: the are constructed via
its get_request[1] method, all the SmartClientStreamMedium classes track a
_current_request instance attribute. (This state has to be tracked so we can
generate TooManyConcurrentRequests errors.)

It's perhaps not clear in the API docs but _current_request is always set to
None when the current request/response cycle has been fully consumed, and I
think it would be reasonable to rely on this (and add a public API to expose
this fact cleanly, perhaps has_unfinished_request?). So you don't need to
involve the higher-level client necessarily.

> Retrying may be a bit complex if we're planning to send streaming data
> coming from a generator; so ideally we will find that the connection was
> closed and restart before we start taking anything from a generator. We
> could just leave this case unhandled and deal with it if it does come
> up: it wasn't here.

Also there's the tricky case of when you've fully sent a request but the
connection died before you got the response: did the request complete or not?
Some requests are read-only or might be idempotent, but not all of them, so you
can't retry a request without some higher-level knowledge about what can be
retried (or perhaps in some cases how to retry: e.g. if you don't know if your
lock_write call worked you can perhaps try a second lock_write and inspect the
LockContention details to see if it's actually your lock or not?).

And in the extreme there's no guarantee that the server hasn't started acting on
a request before receiving the complete request. Off the top of my head I don't
think we have any current requests where that's a practical issue (e.g.
insert_stream writes to a temporary file with a randomly generated name), but
probably worth quickly scanning the request implementations to make sure there's
no surprises here, and then documenting any new constraints this adds to the
protocol.

> An alternative approach might be check if the pipe/socket has closed
> before we try to write the start of a request. In the case that the
> remote end has timed out and closed the socket and we are using an
> external ssh it will probably have already noticed this.

At a guess we'll often first notice timed out connections when we start writing
a request (and the socket may even appear to be alive until we try that, e.g. if
a NAT table in a router has timed out our connection then our TCP stack may know
nothing about it). So perhaps a better alternative is “if the pipe/socket has
closed on our first attempt to write to it during a request, then try
reconnecting”?

[1] And at a glance I see a bug that SmartClientMedium doesn't actually define
    get_request, although it's intended to define it as an abstract method. It
    doesn't matter in practice because all subclasses do define it.