when an idle ssh transport is interrupted, bzrlib errors; should reconnect instead

Bug #819604 reported by Monty Taylor on 2011-08-02
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
High
John A Meinel
2.1
High
John A Meinel
2.2
High
John A Meinel
2.3
High
John A Meinel
2.4
High
John A Meinel

Bug Description

Recent haproxy rollout on launchpad has illuminated a bzrlib bug:

20:29 <lifeless> mtaylor: wgrant: so, this is a bzrlib bug.
20:29 <lifeless> not the holding open
20:29 <lifeless> the handling of the disconnects.
20:29 <lifeless> transports are defined stateless.
20:29 <lifeless> so are RPC calls.
20:31 <lifeless> the haproxy stuff is showing this up very visibly, but the bug pre-existed - bzr didn't seen keepalives, and if ssh wasn't configured to do so it could naturally happen anyhow

The following traceback shows the carnage which happens when tarmac holds open its connection around the post-merge hook:

Traceback (most recent call last):
  File "/usr/bin/tarmac", line 6, in <module>
    main()
  File "/usr/lib/pymodules/python2.6/tarmac/bin/__init__.py", line 30, in main
    registry.run(args)
  File "/usr/lib/pymodules/python2.6/tarmac/bin/registry.py", line 60, in run
    self._run(args)
  File "/usr/lib/pymodules/python2.6/tarmac/bin/registry.py", line 48, in _run
    run_bzr(args)
  File "/usr/lib/pymodules/python2.6/bzrlib/commands.py", line 1124, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/pymodules/python2.6/bzrlib/commands.py", line 689, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib/pymodules/python2.6/bzrlib/commands.py", line 711, in run
    return self._operation.run_simple(*args, **kwargs)
  File "/usr/lib/pymodules/python2.6/bzrlib/cleanup.py", line 135, in run_simple
    self.cleanups, self.func, *args, **kwargs)
  File "/usr/lib/pymodules/python2.6/bzrlib/cleanup.py", line 165, in _do_with_cleanups
    result = func(*args, **kwargs)
  File "/usr/lib/pymodules/python2.6/tarmac/bin/commands.py", line 356, in run
    self._do_merges(branch_url)
  File "/usr/lib/pymodules/python2.6/tarmac/bin/commands.py", line 274, in _do_merges
    authors=source.authors,
  File "/usr/lib/pymodules/python2.6/tarmac/branch.py", line 163, in authors
    self.bzr_branch.lock_read()
  File "/usr/lib/pymodules/python2.6/bzrlib/remote.py", line 2430, in lock_read
    self.repository.lock_read()
  File "/usr/lib/pymodules/python2.6/bzrlib/remote.py", line 1021, in lock_read
    self._real_repository.lock_read()
  File "/usr/lib/pymodules/python2.6/bzrlib/repofmt/pack_repo.py", line 2402, in lock_read
    repo.lock_read()
  File "/usr/lib/pymodules/python2.6/bzrlib/remote.py", line 1021, in lock_read
    self._real_repository.lock_read()
  File "/usr/lib/pymodules/python2.6/bzrlib/repofmt/pack_repo.py", line 2403, in lock_read
    self._refresh_data()
  File "/usr/lib/pymodules/python2.6/bzrlib/repofmt/pack_repo.py", line 2327, in _refresh_data
    self._pack_collection.reload_pack_names()
  File "/usr/lib/pymodules/python2.6/bzrlib/repofmt/pack_repo.py", line 2040, in reload_pack_names
    orig_disk_nodes) = self._diff_pack_names()
  File "/usr/lib/pymodules/python2.6/bzrlib/repofmt/pack_repo.py", line 1906, in _diff_pack_names
    for index, key, value in self._iter_disk_pack_index():
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 996, in iter_all_entries
    if not self.key_count():
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 1443, in key_count
    self._get_root_node()
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 940, in _get_root_node
    self._get_internal_nodes([0])
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 965, in _get_internal_nodes
    return self._get_nodes(self._internal_node_cache, node_indexes)
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 957, in _get_nodes
    found.update(self._get_and_cache_nodes(needed))
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 734, in _get_and_cache_nodes
    for node_pos, node in self._read_nodes(sorted(nodes)):
  File "/usr/lib/pymodules/python2.6/bzrlib/btree_index.py", line 1530, in _read_nodes
    bytes = self._transport.get_bytes(self._name)
  File "/usr/lib/pymodules/python2.6/bzrlib/transport/remote.py", line 226, in get_bytes
    resp, response_handler = self._client.call_expecting_body('get', remote)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/client.py", line 145, in call_expecting_body
    method, args, expect_response_body=True)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/client.py", line 79, in _call_and_read_response
    readv_body=readv_body, body_stream=body_stream)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/client.py", line 63, in _send_request
    encoder.call(method, *args)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/protocol.py", line 1309, in call
    self._write_end()
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/protocol.py", line 1133, in _write_end
    self.flush()
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/protocol.py", line 1099, in flush
    self._real_write_func(''.join(self._buf))
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/medium.py", line 395, in accept_bytes
    self._accept_bytes(bytes)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/medium.py", line 977, in _accept_bytes
    self._medium._accept_bytes(bytes)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/medium.py", line 794, in _accept_bytes
    self._real_medium.accept_bytes(bytes)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/medium.py", line 688, in accept_bytes
    self._accept_bytes(bytes)
  File "/usr/lib/pymodules/python2.6/bzrlib/smart/medium.py", line 861, in _accept_bytes
    osutils.send_all(self._socket, bytes, self._report_activity)
  File "/usr/lib/pymodules/python2.6/bzrlib/osutils.py", line 2075, in send_all
    sent = sock.send(buffer(bytes, sent_total, MAX_SOCKET_CHUNK))
socket.error: [Errno 32] Broken pipe

implementation plan:

 * catch the socket.error and change it into a higher-level "connection closed" error
 * this might need care to cover both internal/paramiko and external ssh clients; external clients might give different errors on Windows
 * in the client per-call method, catch this error if it happens while sending the initial request, log a message, and build a new medium(?) object

To test it automatically:

 * make a special medium that raises this error through monkeypatching
 * use existing ssh tests that start a real server to check that the right error is raised when the

Interactive tests are probably a good idea considering the interaction with external systems
 * start a bzr ssh client from a python shell
 * kill the server process
 * send a new request and check that it logs and reconnects

Related branches

Jelmer Vernooij (jelmer) on 2011-08-02
Changed in bzr:
status: New → Triaged
status: Triaged → Confirmed
importance: Undecided → High
Jelmer Vernooij (jelmer) on 2011-08-02
Changed in bzr:
importance: High → Critical
assignee: nobody → canonical-bazaar (canonical-bazaar)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/2/2011 12:45 PM, Jelmer Vernooij wrote:
> ** Changed in: bzr Importance: High => Critical
>
> ** Changed in: bzr Assignee: (unassigned) => canonical-bazaar
> (canonical-bazaar)
>

Is this actually critical?

Also, the initial report was "idle ssh transport". Monty's traceback is
certainly not 'idle' given that we are actively fetching content.

And while the transfer-of-content is roughly stateless, I'm not 100%
sure we want to default-reconnect SSH connections.

I suppose we could use a reasonable try-again-but-not-forever, or
try-once-but-fail-if-last-connect-failed. People with flaky connections
would get slow-but-useful connections. Restarting codehosting would
allow things to continue roughly interrupted, but the network going down
wouldn't cause us to spin indefinitely.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk48AXUACgkQJdeBCYSNAANUhQCgqCGM8zIhGfqife6E2TIaLwQx
Nc4AnRq18lIiBiyGM6t7/uy+SmXgQlXt
=8wVS
-----END PGP SIGNATURE-----

Jelmer Vernooij (jelmer) wrote :

On 05/08/11 16:43, John A Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/2/2011 12:45 PM, Jelmer Vernooij wrote:
>> ** Changed in: bzr Importance: High => Critical
>>
>> ** Changed in: bzr Assignee: (unassigned) => canonical-bazaar
>> (canonical-bazaar)
>>
> Is this actually critical?
I'm not sure if it is critical anymore. The timeout for codehosting
haproxy has now been increased, which prevents this from being a big
problem at the moment.
>
> Also, the initial report was "idle ssh transport". Monty's traceback is
> certainly not 'idle' given that we are actively fetching content.
I think the problem is that the connection was idle for a while - enough
for it to timeout. Trying to reuse it again later then triggered this
exception, at least that is my understanding of it.

> And while the transfer-of-content is roughly stateless, I'm not 100%
> sure we want to default-reconnect SSH connections.
>
> I suppose we could use a reasonable try-again-but-not-forever, or
> try-once-but-fail-if-last-connect-failed. People with flaky connections
> would get slow-but-useful connections. Restarting codehosting would
> allow things to continue roughly interrupted, but the network going down
> wouldn't cause us to spin indefinitely.
>
That sounds reasonable.

Cheers,

Jelmer

Robert Collins (lifeless) wrote :

This is critical for LP: we're currently unable to deploy to the
codehosting service again, until this is fixed and pushed out to SRUs.

The logic being:
 - we can't reasonably implement a server side 'please disconnect now'
trigger to the bzr server if it will cause all current clients to
error
 - we can't deploy to the service if it needs an admin to baby sit it for hours
 - and we can't set the idle time low enough to timeout on CI servers
if the timeout causes clients to error
 - and so with no timeout and no explicit 'please stop now' deploys to
that service always end up having to interrupt a connection that is in
an unknown state.

Its also critical path for the faster ssh daemon stuff, because until
we can deploy reliably we can't half-deploy and experiment with
confidence.

-Rob

Martin Pool (mbp) wrote :

This doesn't need to block bzr 2.4.0 release, so I'm going to
downgrade it to high, and also opening tasks for the backports. Now
the timeout is backed off enough it's not dropping all the time it's
not an operational incident.

> This is critical for LP: we're currently unable to deploy to the codehosting service again, until this is fixed and pushed out to SRUs.

I think you still can deploy, the same way you could deploy a month
ago. Clients will drop their connections and presumably recover in
some client-specific way, perhaps generating some noise.

Fixing this will get Launchpad closer to doing apparently-seamless
deployments, but we also have to fix things like bug 795025.

tags: added: error-reporting hpss launchpad ssh
Martin Pool (mbp) on 2011-08-10
Changed in bzr:
importance: Critical → High
assignee: canonical-bazaar (canonical-bazaar) → Martin Pool (mbp)
Martin Pool (mbp) on 2011-08-10
description: updated

The medium is a fairly thin wrapper around the socket/pipes for the connection. It seems it would be reasonable to either have the medium reconnect, or to throw away the medium and make a new one.

There are several _accept_bytes implementations; the most relevant ones here are SmartTCPClientMedium and (especially for Launchpad) SmartSSHClientMedium.

We can't just reconnect at arbitrary points in the protocol, only at the start of a request. The medium itself doesn't know about request boundaries so the higher-level client needs to be involved.

Retrying may be a bit complex if we're planning to send streaming data coming from a generator; so ideally we will find that the connection was closed and restart before we start taking anything from a generator. We could just leave this case unhandled and deal with it if it does come up: it wasn't here.

An alternative approach might be check if the pipe/socket has closed before we try to write the start of a request. In the case that the remote end has timed out and closed the socket and we are using an external ssh it will probably have already noticed this.

Martin Pool wrote:
> We can't just reconnect at arbitrary points in the protocol, only at the
> start of a request. The medium itself doesn't know about request
> boundaries so the higher-level client needs to be involved.

Actually, SmartClientMedium does know about requests: the are constructed via
its get_request[1] method, all the SmartClientStreamMedium classes track a
_current_request instance attribute. (This state has to be tracked so we can
generate TooManyConcurrentRequests errors.)

It's perhaps not clear in the API docs but _current_request is always set to
None when the current request/response cycle has been fully consumed, and I
think it would be reasonable to rely on this (and add a public API to expose
this fact cleanly, perhaps has_unfinished_request?). So you don't need to
involve the higher-level client necessarily.

> Retrying may be a bit complex if we're planning to send streaming data
> coming from a generator; so ideally we will find that the connection was
> closed and restart before we start taking anything from a generator. We
> could just leave this case unhandled and deal with it if it does come
> up: it wasn't here.

Also there's the tricky case of when you've fully sent a request but the
connection died before you got the response: did the request complete or not?
Some requests are read-only or might be idempotent, but not all of them, so you
can't retry a request without some higher-level knowledge about what can be
retried (or perhaps in some cases how to retry: e.g. if you don't know if your
lock_write call worked you can perhaps try a second lock_write and inspect the
LockContention details to see if it's actually your lock or not?).

And in the extreme there's no guarantee that the server hasn't started acting on
a request before receiving the complete request. Off the top of my head I don't
think we have any current requests where that's a practical issue (e.g.
insert_stream writes to a temporary file with a randomly generated name), but
probably worth quickly scanning the request implementations to make sure there's
no surprises here, and then documenting any new constraints this adds to the
protocol.

> An alternative approach might be check if the pipe/socket has closed
> before we try to write the start of a request. In the case that the
> remote end has timed out and closed the socket and we are using an
> external ssh it will probably have already noticed this.

At a guess we'll often first notice timed out connections when we start writing
a request (and the socket may even appear to be alive until we try that, e.g. if
a NAT table in a router has timed out our connection then our TCP stack may know
nothing about it). So perhaps a better alternative is “if the pipe/socket has
closed on our first attempt to write to it during a request, then try
reconnecting”?

[1] And at a glance I see a bug that SmartClientMedium doesn't actually define
    get_request, although it's intended to define it as an abstract method. It
    doesn't matter in practice because all subclasses do define it.

here's an external black box test that breaks the ssh connection to Launchpad

Download full text (3.1 KiB)

On 10 August 2011 17:11, Andrew Bennetts
<email address hidden> wrote:
> Martin Pool wrote:
>> We can't just reconnect at arbitrary points in the protocol, only at the
>> start of a request.  The medium itself doesn't know about request
>> boundaries so the higher-level client needs to be involved.
>
> Actually, SmartClientMedium does know about requests: the are constructed via
> its get_request[1] method, all the SmartClientStreamMedium classes track a
> _current_request instance attribute.  (This state has to be tracked so we can
> generate TooManyConcurrentRequests errors.)

And of course we want to generate them ;-)

>
> It's perhaps not clear in the API docs but _current_request is always set to
> None when the current request/response cycle has been fully consumed, and I
> think it would be reasonable to rely on this (and add a public API to expose
> this fact cleanly, perhaps has_unfinished_request?).  So you don't need to
> involve the higher-level client necessarily.

I see, so it's not exactly the medium that knows, but it's something
quite intimately related.

> Also there's the tricky case of when you've fully sent a request but the
> connection died before you got the response: did the request complete or not?
> Some requests are read-only or might be idempotent, but not all of them, so you
> can't retry a request without some higher-level knowledge about what can be
> retried (or perhaps in some cases how to retry: e.g. if you don't know if your
> lock_write call worked you can perhaps try a second lock_write and inspect the
> LockContention details to see if it's actually your lock or not?).
>
> And in the extreme there's no guarantee that the server hasn't started acting on
> a request before receiving the complete request.  Off the top of my head I don't
> think we have any current requests where that's a practical issue (e.g.
> insert_stream writes to a temporary file with a randomly generated name), but
> probably worth quickly scanning the request implementations to make sure there's
> no surprises here, and then documenting any new constraints this adds to the
> protocol.
>
>> An alternative approach might be check if the pipe/socket has closed
>> before we try to write the start of a request.  In the case that the
>> remote end has timed out and closed the socket and we are using an
>> external ssh it will probably have already noticed this.
>
> At a guess we'll often first notice timed out connections when we start writing
> a request (and the socket may even appear to be alive until we try that, e.g. if
> a NAT table in a router has timed out our connection then our TCP stack may know
> nothing about it).  So perhaps a better alternative is “if the pipe/socket has
> closed on our first attempt to write to it during a request, then try
> reconnecting”?

Yes, something like that. In the case we care about typically there
will be an external ssh process in the line, so there's a bit more
buffering: we could write several things to ssh before it starts
writing to the network and on the other hand ssh may discover the
connection has closed even before we actually write anything. I think
some trials in realisti...

Read more...

Andrew Bennetts (spiv) wrote :

Martin Pool wrote:
> > _current_request instance attribute.  (This state has to be tracked so we can
> > generate TooManyConcurrentRequests errors.)
>
> And of course we want to generate them ;-)

Something like that ;)

Martin Pool (mbp) on 2011-08-12
summary: - when an idle ssh transport is interrupted, bzrlib errors
+ when an idle ssh transport is interrupted, bzrlib errors; should
+ reconnect instead
John A Meinel (jameinel) wrote :

Related to bug #824797

Changed in bzr:
assignee: Martin Pool (mbp) → John A Meinel (jameinel)
John A Meinel (jameinel) on 2011-09-26
Changed in bzr:
status: Confirmed → Triaged
status: Triaged → In Progress
John A Meinel (jameinel) wrote :
Download full text (6.0 KiB)

In my simple testing, I actually found that we don't discover we are disconnected until we *read* from the closed socket. I'm guessing the issue is that the write buffer is large enough vs the amount of content we are writing ('get', 'filename') that it just gets buffered until we wait to read back the response, and then the local socket notices that the remote one closed its end. I even tried an early socket.shutdown(SHUT_RD) on the server side, but still the only failure I got was during "_read_more."

This certainly makes it a bit more clumsy. Also, we default to buffering ~1MB (until the end of 'this write') before we actually write anything onto the socket.

All the layers and indirection, not to mention that sometimes we are writing to SSH, sometimes to 'simple pipes', sometimes to a socket, sometimes over HTTP, etc, etc.

I'm focusing on "(get)" as a first draft, because after doing "get_record_stream()" the first thing (at least older clients) do is probe the remote object for the real format, which involves reading .bzr/branch-format directly.

And I'm focusing on 'first thing after stream', because that is likely to be the bulk of the time spent, so it is the most likely time that we would end up interrupted. (And the most frustrating, since you'll finish 10min of data transfer, only to be interrupted checking if there are tags you might need.) An example traceback (bzr-2.1):

...
  File "C:\dev\bzr\bzr.dev\bzrlib\fetch.py", line 103, in __fetch
    self._fetch_everything_for_search(search_result)
  File "C:\dev\bzr\bzr.dev\bzrlib\fetch.py", line 134, in _fetch_everything_for_search
    stream = source.get_stream_for_missing_keys(missing_keys)
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 2197, in get_stream_for_missing_keys
    self.from_repository._ensure_real()
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 1089, in _ensure_real
    self.bzrdir._ensure_real()
  File "C:\dev\bzr\bzr.dev\bzrlib\remote.py", line 410, in _ensure_real
    self.root_transport, _server_formats=False)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 858, in open_from_transport
    redirected)
  File "C:\dev\bzr\bzr.dev\bzrlib\transport\__init__.py", line 1683, in do_catching_redirections
    return action(transport)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 845, in find_format
    transport, _server_formats=_server_formats)
  File "C:\dev\bzr\bzr.dev\bzrlib\controldir.py", line 769, in find_format
    return prober.probe_transport(transport)
  File "C:\dev\bzr\bzr.dev\bzrlib\bzrdir.py", line 1346, in probe_transport
    format_string = transport.get_bytes(".bzr/branch-format")
  File "C:\dev\bzr\bzr.dev\bzrlib\transport\remote.py", line 226, in get_bytes
    resp, response_handler = self._client.call_expecting_body('get', remote)
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\client.py", line 145, in call_expecting_body
    method, args, expect_response_body=True)
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\client.py", line 81, in _call_and_read_response
    expect_body=expect_response_body),
  File "C:\dev\bzr\bzr.dev\bzrlib\smart\message.py", line 299, in read_response_tuple
    self._wait_for_response_args()
  File "C:\dev\bzr\bzr.dev\bzrlib\...

Read more...

I agree that there's a lot of buffering and indirection so it will be
hard to detect it for writes.

I do wonder if you could try a nonblocking select-for-read before we
start writing: if the remote end has closed, this may tell us "there's
something readable" and then the thing we read turns out to be an EOF.
 I think that is the correct semantic for determining when the remote
end has closed. However, the layers may still trip us up, if it turns
out that ssh does not close the client's stdin until the entire
connection has closed. There is a complication also if we actually
get a non-eof byte, but that is unlikely and probably could be
handled. I think this would be worth a try.

John A Meinel (jameinel) wrote :
Download full text (3.1 KiB)

There is an initial implementation in the linked branch: lp:~jameinel/bzr/2.1-client-reconnect-819604

I'm starting the work based on 2.1, since that is what is present in LTS Lucid. (Hardy only has 1.3, and I don't think we are going to bother backporting that far.)

ATM, it works only during the write phase. So if we get a ConnectionReset while writing out the request, we assume:

a) If we didn't finish writing before getting ConnectionReset, the server will throw away the request as being incomplete. Thus
b) It is always safe to just retry the request.

I tested it successfully by doing a branch from devpad (using bzr.dev on the server), and sending SIGHUP to the 'bzr serve --inet' process. It shutdown gracefully on devpad, the client logged "hey I got disconnected, I'll retry", and the 'bzr branch' finished.

I did observe that if I did the SIGHUP during the 'get_parent_map' discovery phase, it actually failed during 'read_response' rather than 'write_end'. I'm assuming that is because get_parent_map is happening fast enough the time between reading the previous response and writing the next one was insufficient for the local ssh client to notice that the connection hand been closed.

Compare that with "get_stream()" which has to spend some local time chewing on the final request stream (such as validating all the indexes refer to proper data, etc).

Now, I think the patch as it stands is already helpful, since most times if you are getting SIGHUP, it is going to occur during long-running requests.

The remaining concerns as I see it:

1) I can't re-try 'body_stream' requests at this level. I suppose at the lower level we could see if we got 'ConnectionReset' before we started consuming any of the body_stream. However, the way caching is implemented, we default to buffering 1MB of local content before we send any of it. Which sounds like it is pretty likely to get at least some of body_stream (possibly all).
  a) If we know that we got all the request (we got all the way to _write_end), it seems like we could just re-send our local buffer. (If we also check that there is only one buffer that needs to be sent, etc.)
  b) We change the logic so that ConnectionReset is handled much higher, at least for cases where body_stream is involved. I like handling it at this layer if we can, because it means I don't have to update tons of call sites.

2) Some requests might not be obviously incomplete if you only write some of the request header. However, I'm pretty sure that ProtocolThree prefixes all requests with a number-of-bytes header. I think it can send 'chunked' body content, but if it does so, I'm hoping it clearly identifies "end-of-chunks". I'll try to validate that.
That would validate my assumption that "if we fail during write, we are guaranteed the server will ignore the request".

3) I think some (many, in fact) of our requests are perfectly idempotent. Such as 'get' or 'get_parent_map', etc. I'm not sure yet how to tell whether a given request is ok to retry transparently. For example 'insert_stream' doesn't seem like it should be retried if we fail during reading the response. (Though here it is certainly much more likely to fail du...

Read more...

John A Meinel (jameinel) wrote :

Some thoughts about idempotent requests.

1) Any read-only request should be idempotent.

2) This leaves requests like put(_bytes) and locking calls. Also, append. Note that RemoteTransport.open_write_stream uses an AppendBasedFileStream structure. So if we are using VFS to write out a remote pack file, we will call 'append $PATH content' repeatedly. 'insert_stream_1.19' is also a concern here.

2a) put_bytes should be an all-or-nothing call. If we repeat it, the final content is going to be the same, we may just be uploading the same content twice.

2b) append seems to be the one that can cause the most problems. bzrlib.pack.ContainerWriter does have start and end records that are potentially written as different append calls. It has a '.begin()' which calls write_func then a '.add_bytes_record()' that creates a string with headers and then calls 'write_func', then a final '.end()' that adds the trailer.

So imagine that any one of those got interrupted via a real network disconnect after the server had gotten the message and completed the request, before the client got the 'ok'. You would end up writing 'begin' two times before an actual record, etc.

To make matters even worse, ContainerWriter tracks 'current_offset'. Which means that if any 'write()' call ended up double sent, the index would no longer line up with the real content on disk.

I had originally thought we might be ok, because the indexes only reference the actual content in the pack file. So we could have ended up with bogus data in the middle, but it wouldn't be referenced. However:
 i) Pack files really do want to be readable without their associated index. They aren't 100% ok today, but leaving garbage in the middle wouldn't be a good way forward.
 ii) Our 'write_func' doesn't have any concept of number-of-bytes written and offset-they-were-written to. At least not at the level of ContainerWriter. We *do* have that concept as part of 'Transport.append()' but it isn't exposed to the other locations such that we could make use of it.

3) So I'm thinking to just figure out a way to flag specific remote 'method' calls as being illegal to retry.

I'm thinking to try and hook it into bzrlib.smart.request.request_handlers. That is a server-side registry, but it is the only place that we know about what methods might be available. And even if a server implements different apis, a client won't ever try a request it doesn't know about. And I don't think a given request will change from being safe to unsafe.

Vincent Ladeuil (vila) wrote :

I like the sound of it.

A useful first step (IMHO) will be to retry the write requests at the highest possible level.

From there we'll know we have a safety net and can try to refine this at a finer-grained level.

John A Meinel (jameinel) wrote :
Download full text (8.4 KiB)

Analysis of whether any given RPC is safe to retry. Because of potential
corruption, I'm tempted to whitelist them instead of blacklist them. However,
99% of requests are safe, so whitelisting is a bit of a pain.

I'm thinking to use the 'info' parameter of register_lazy to signal the
retry-semantics of a given command.

    'append', 'bzrlib.smart.vfs', 'AppendRequest')
Not safe. The current location in the file is tracked on the client for some
callers of this function. An append which succeeded but then got retried will
put bad data in the middle of the file.

    'Branch.get_config_file', 'bzrlib.smart.branch',
    'SmartServerBranchGetConfigFile')
safe, read-only

    'Branch.get_parent', 'bzrlib.smart.branch', 'SmartServerBranchGetParent')
safe, read-only

    'Branch.get_tags_bytes', 'bzrlib.smart.branch',
    'SmartServerBranchGetTagsBytes')
safe, read-only

    'Branch.set_tags_bytes', 'bzrlib.smart.branch',
    'SmartServerBranchSetTagsBytes')
safe, this is an all-or-nothing set. so setting it two times to the same thing
still leaves you at the same content in the end.

    'Branch.get_stacked_on_url', 'bzrlib.smart.branch', 'SmartServerBranchRequestGetStackedOnURL')
safe, read-only

    'Branch.last_revision_info', 'bzrlib.smart.branch', 'SmartServerBranchRequestLastRevisionInfo')
safe, read-only

    'Branch.lock_write', 'bzrlib.smart.branch', 'SmartServerBranchRequestLockWrite')
safe, will fail if first succeeded
safe in that it won't cause corruption. not safe in that if the first request
succeeds and the connection dies (so we don't get to see the lock tocken), the
second request will fail because the branch is already locked.
I'm still willing to call this 'safe', since it won't silently corrupt
anything. The worst case is that the action will get interrupted, but we
already had that problem because of ConnectionReset. (Retrying is strictly
better than not retrying.)

    'Branch.revision_history', 'bzrlib.smart.branch', 'SmartServerRequestRevisionHistory')
safe, read-only

    'Branch.set_config_option', 'bzrlib.smart.branch', 'SmartServerBranchRequestSetConfigOption')
I think this is safe. Setting an option to the same value twice should be safe.
The config code already should handle multiple concurrent requests.

    'Branch.set_last_revision', 'bzrlib.smart.branch', 'SmartServerBranchRequestSetLastRevision')
deprecated function, since it uses branch._lefthand_history to determine the
revision number to store. However, still safe, as setting the tip revision to
the same value is safe.

    'Branch.set_last_revision_info', 'bzrlib.smart.branch', 'SmartServerBranchRequestSetLastRevisionInfo')
safe

    'Branch.set_last_revision_ex', 'bzrlib.smart.branch', 'SmartServerBranchRequestSetLastRevisionEx')
safe

    'Branch.set_parent_location', 'bzrlib.smart.branch', 'SmartServerBranchRequestSetParentLocation')
safe

    'Branch.unlock', 'bzrlib.smart.branch', 'SmartServerBranchRequestUnlock')
safe, will fail if first succeeded
same as the lock_write call. If the first one succeeds, the second one will
fail with a 'not locked' and cause us to spuriously interrupt what we are
doing. However, it is still 'safe' in that we would ha...

Read more...

John A Meinel (jameinel) wrote :

The other option to putting it in bzrlib.smart.request.request_handlers registry under 'info' is to just add it as an attribute of the individual classes. The problem is I would be tempted to use inheritance to avoid repeating myself, and that would mean new requests have not been analyzed for safe/unsafe.

Andrew Bennetts (spiv) wrote :

On Sat, Oct 08, 2011 at 11:27:13AM -0000, John A Meinel wrote:
[...]
> 'delete', 'bzrlib.smart.vfs', 'DeleteRequest')
> safe, will fail if first succeeded

I worry a little that repeating this (and other destructive VFS ops like move)
might be bad in a case like the “simultaneous identical pull” case that the
Twisted buildbot encountered. e.g.:

 * connection A tries to delete (or move to obsolete) some .pack or .?ix
   file, it succeeds but the network connection was lost so the success
   was not reported to the client
 * connection B, re-adds that file (I think this is possible under
   the right conditions)
 * connection A reconnects and retries the delete, and delete's B's
   file from under it.
 * now pack-names written by B references a missing file

I expect this case is rare, and perhaps is a risk we already run on some
network filesystems with unusual consistency guarantees, but I think
it's a real risk. After all I'm not sure any common network filesystems
have this particular kind of behaviour. I'm not sure if it's worth
avoiding retries for, but I think it is worth thinking carefully about
just in caseexpect this case is rare, and perhaps is a risk we already
run on some network filesystems with unusual consistency guarantees, but
I think it's a real risk. I'm not sure if it's worth avoiding retries
for, but I think it is worth thinking carefully about just in case.

Basically, we assume some kinds of atomicity & consistency from our
filesystem operations, and these retries possibly erode those
properties. So I'd be tempted to blacklist all non-readonly VFS ops out
of paranoia. We shouldn't be issuing many VFS calls anyway ;)

Andrew Bennetts (spiv) wrote :

On Sat, Oct 08, 2011 at 11:31:10AM -0000, John A Meinel wrote:
> The other option to putting it in bzrlib.smart.request.request_handlers
> registry under 'info' is to just add it as an attribute of the
> individual classes. The problem is I would be tempted to use inheritance
> to avoid repeating myself, and that would mean new requests have not
> been analyzed for safe/unsafe.

I know what you mean. (And another factor here is it's ok for e.g. a
plugin to override the registration of an existing verb, and perhaps
that would be done by subclassing.)

But I think in practice new verbs are added sufficiently rarely that it
will make little difference either way, assuming you don't set retry_ok
= True in the base class.

Martin Pool (mbp) wrote :

On 10 October 2011 15:58, Andrew Bennetts
<email address hidden> wrote:
> So I'd be tempted to blacklist all non-readonly VFS ops out
> of paranoia.  We shouldn't be issuing many VFS calls anyway ;)

+1

Perhaps it would be good to have a debug flag to control what can be
retried so that we can easily experiment on the real system, and I
think it would definitely be worth having a config option to turn it
off altogether in case it does turn out to bite some people.

I do wonder if we will really hit the case of a disconnect in the
middle of a write request in realistic cases. In the typical
Launchpad thing of it wanting to restart the connection, it should
reply to one request and then close the socket, and so then we will
see the connection close at the end of the previous request. Thus my
suggestion about doing a nonblocking select-for-read immediately
before starting to write. (In fact you could select for read and
write before writing, which in the case the write side is full will
let us wait a little longer for a disconnection.)

Obviously if the connection drops because of a network error we might
be more likely to be in the middle of writing a request or getting a
response.

So, overall, I would not bother about retrying any write operations
that are not certain to be safe, until we see whether basic
reconnection will be ok.

John A Meinel (jameinel) wrote :
Download full text (3.3 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/10/2011 6:58 AM, Andrew Bennetts wrote:
> On Sat, Oct 08, 2011 at 11:27:13AM -0000, John A Meinel wrote:
> [...]
>> 'delete', 'bzrlib.smart.vfs', 'DeleteRequest') safe, will fail if
>> first succeeded
>
> I worry a little that repeating this (and other destructive VFS ops
> like move) might be bad in a case like the “simultaneous identical
> pull” case that the Twisted buildbot encountered. e.g.:

...

>
> I expect this case is rare, and perhaps is a risk we already run on
> some network filesystems with unusual consistency guarantees, but I
> think it's a real risk. After all I'm not sure any common network
> filesystems have this particular kind of behaviour. I'm not sure
> if it's worth avoiding retries for, but I think it is worth
> thinking carefully about just in caseexpect this case is rare, and
> perhaps is a risk we already run on some network filesystems with
> unusual consistency guarantees, but I think it's a real risk. I'm
> not sure if it's worth avoiding retries for, but I think it is
> worth thinking carefully about just in case.

I'm pretty sure we already run the risk because renaming over the file
is atomic.

>
> Basically, we assume some kinds of atomicity & consistency from
> our filesystem operations, and these retries possibly erode those
> properties. So I'd be tempted to blacklist all non-readonly VFS
> ops out of paranoia. We shouldn't be issuing many VFS calls anyway
> ;)
>

I like the idea, I'm worried about practice. Specifically, we want to
add retry code so that we can get nodowntime deployments on Launchpad
without causing large headaches for existing users.

I know that in 2.1, the first thing we do after 'get_stream' is
'get_stream_for_missing_keys' which in 2.1 is implemented in VFS
operations. Now, this particular case it should all be read operations.

And in the end we have some sort of slider and at what point is 'good
enough'?

It seems like we need to handle stuff like race conditions using the
existing primatives (like locking the repository for the final
mutating operations).

If you have the case today where process A is trying to
  rename pack/foo => ../obsolete_packs/foo
and process B is trying to
  rename upload/bar => ../pack/foo

They already race, and I'm not making that race window particularly
wider. Whatever final solution we have for that case will handle this
one as well. (IMO)

For example, *today* the second rename could complete successfully
first, and then the first rename succeeds. Causing the newly uploaded
pack file to be removed. (Which is the Twisted case, AFAICT.) Which
only triggers when we are inserting the same data in two processes and
obsoleting it at the same time. 'retry' doesn't seem relevant there.

Anyway, I'm still fine adding a bit more flexibility and just saying
'semi' is not considered safe enough. Or we can move it to mutating
vfs, but then put_bytes isn't safe, and that probably affects lots of
stuff (init still uses VFS I think, etc.)

Also if we land something we can tweak it from there :).

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla ...

Read more...

John A Meinel (jameinel) on 2011-10-10
Changed in bzr:
milestone: none → 2.5b3
Vincent Ladeuil (vila) on 2011-11-10
Changed in bzr:
milestone: 2.5b3 → 2.5b4
Vincent Ladeuil (vila) on 2011-11-29
Changed in bzr:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers