Bazaar

Bug #819604
Comment #8

Comment 8 for bug 819604

Revision history for this message

Martin Pool (mbp) wrote on 2011-08-11: Re: [Bug 819604] Re: when an idle ssh transport is interrupted, bzrlib errors

On 10 August 2011 17:11, Andrew Bennetts
<email address hidden> wrote:
> Martin Pool wrote:
>> We can't just reconnect at arbitrary points in the protocol, only at the
>> start of a request. The medium itself doesn't know about request
>> boundaries so the higher-level client needs to be involved.
>
> Actually, SmartClientMedium does know about requests: the are constructed via
> its get_request[1] method, all the SmartClientStreamMedium classes track a
> _current_request instance attribute. (This state has to be tracked so we can
> generate TooManyConcurrentRequests errors.)

And of course we want to generate them ;-)

>
> It's perhaps not clear in the API docs but _current_request is always set to
> None when the current request/response cycle has been fully consumed, and I
> think it would be reasonable to rely on this (and add a public API to expose
> this fact cleanly, perhaps has_unfinished_request?). So you don't need to
> involve the higher-level client necessarily.

I see, so it's not exactly the medium that knows, but it's something
quite intimately related.

> Also there's the tricky case of when you've fully sent a request but the
> connection died before you got the response: did the request complete or not?
> Some requests are read-only or might be idempotent, but not all of them, so you
> can't retry a request without some higher-level knowledge about what can be
> retried (or perhaps in some cases how to retry: e.g. if you don't know if your
> lock_write call worked you can perhaps try a second lock_write and inspect the
> LockContention details to see if it's actually your lock or not?).
>
> And in the extreme there's no guarantee that the server hasn't started acting on
> a request before receiving the complete request. Off the top of my head I don't
> think we have any current requests where that's a practical issue (e.g.
> insert_stream writes to a temporary file with a randomly generated name), but
> probably worth quickly scanning the request implementations to make sure there's
> no surprises here, and then documenting any new constraints this adds to the
> protocol.
>
>> An alternative approach might be check if the pipe/socket has closed
>> before we try to write the start of a request. In the case that the
>> remote end has timed out and closed the socket and we are using an
>> external ssh it will probably have already noticed this.
>
> At a guess we'll often first notice timed out connections when we start writing
> a request (and the socket may even appear to be alive until we try that, e.g. if
> a NAT table in a router has timed out our connection then our TCP stack may know
> nothing about it). So perhaps a better alternative is “if the pipe/socket has
> closed on our first attempt to write to it during a request, then try
> reconnecting”?

Yes, something like that. In the case we care about typically there
will be an external ssh process in the line, so there's a bit more
buffering: we could write several things to ssh before it starts
writing to the network and on the other hand ssh may discover the
connection has closed even before we actually write anything. I think
some trials in realistic cases will be needed.

On 10 August 2011 17:11, Andrew Bennetts
<andrew-launchpad@bemusement.org> wrote:
> Martin Pool wrote:
>> We can't just reconnect at arbitrary points in the protocol, only at the
>> start of a request.  The medium itself doesn't know about request
>> boundaries so the higher-level client needs to be involved.
>
> Actually, SmartClientMedium does know about requests: the are constructed via
> its get_request[1] method, all the SmartClientStreamMedium classes track a
> _current_request instance attribute.  (This state has to be tracked so we can
> generate TooManyConcurrentRequests errors.)

And of course we want to generate them ;-)

>
> It's perhaps not clear in the API docs but _current_request is always set to
> None when the current request/response cycle has been fully consumed, and I
> think it would be reasonable to rely on this (and add a public API to expose
> this fact cleanly, perhaps has_unfinished_request?).  So you don't need to
> involve the higher-level client necessarily.

I see, so it's not exactly the medium that knows, but it's something
quite intimately related.

> Also there's the tricky case of when you've fully sent a request but the
> connection died before you got the response: did the request complete or not?
> Some requests are read-only or might be idempotent, but not all of them, so you
> can't retry a request without some higher-level knowledge about what can be
> retried (or perhaps in some cases how to retry: e.g. if you don't know if your
> lock_write call worked you can perhaps try a second lock_write and inspect the
> LockContention details to see if it's actually your lock or not?).
>
> And in the extreme there's no guarantee that the server hasn't started acting on
> a request before receiving the complete request.  Off the top of my head I don't
> think we have any current requests where that's a practical issue (e.g.
> insert_stream writes to a temporary file with a randomly generated name), but
> probably worth quickly scanning the request implementations to make sure there's
> no surprises here, and then documenting any new constraints this adds to the
> protocol.
>
>> An alternative approach might be check if the pipe/socket has closed
>> before we try to write the start of a request.  In the case that the
>> remote end has timed out and closed the socket and we are using an
>> external ssh it will probably have already noticed this.
>
> At a guess we'll often first notice timed out connections when we start writing
> a request (and the socket may even appear to be alive until we try that, e.g. if
> a NAT table in a router has timed out our connection then our TCP stack may know
> nothing about it).  So perhaps a better alternative is “if the pipe/socket has
> closed on our first attempt to write to it during a request, then try
> reconnecting”?

Yes, something like that.  In the case we care about typically there
will be an external ssh process in the line, so there's a bit more
buffering: we could write several things to ssh before it starts
writing to the network and on the other hand ssh may discover the
connection has closed even before we actually write anything.  I think
some trials in realistic cases will be needed.