Mir

[regression] Clients hang without explanation when connecting to older servers since r2730.

Bug #1486496 reported by Daniel van Vugt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
Medium
Alan Griffiths
0.15
Won't Fix
Medium
Unassigned
mir (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Clients hang without explanation when connecting to older servers since r2730.

We intend to support side-by-side installations of different libmirserver versions. The client .so will correspond to the more recent server[1] but using a server application that links to the older server library will cause /all/ clients to fail connecting to that server through the new client library as described above.

The impact of this scenario is currently low as we are able to bundle updates to all downstream servers with the release of Mir - but it is indeed a problem we need to address.

~~~

Test case:
  1. Start a demo server installed on your system, from the package mir-demos
  2. Build lp:mir and try to run bin/mir_demo_client_*

Expected: Client runs and appears on screen.
Observed: Client starts but immediately freezes. Never appears on screen.

~~~~

[1] In 0.14 we landed breaking changes to the client ABI. A significant one of these was to hide the data structures that caused many previous ABI breaks - so client ABI breaks are much less likely than they've been in the past.

Tags: regression

Related branches

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I'm not sure where we broke it. Seems to be somewhere between r2697 and r2764.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Can't bisect as easily because revisions prior to 2764 don't build with the new compiler (gcc5). And the old compiler (gcc-4.9) doesn't build Mir at all any more on wily due to package changes like protobuf. We might be able to bisect using a vivid machine...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bisected. The bug appeared in:

------------------------------------------------------------
revno: 2730 [merge]
author: Kevin DuBois <email address hidden>
committer: Tarmac
branch nick: development-branch
timestamp: Wed 2015-07-08 10:56:20 +0000
message:
  Use the submit_buffer() rpc call with asynchronously-arriving buffers by default. BufferQueue serves the requests for now. exchange_buffer and next_buffer still will work.

  Approved by PS Jenkins bot, Alexandros Frantzis, Chris Halse Rogers.
------------------------------------------------------------

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Using an older server from vivid's mir-demos (0.12.1+15.04.20150324-0ubuntu1), gives similar behaviour. r2729 works but r2730 fails, this time with error messages instead of hanging:

$ sudo bin/mir_demo_client_egltriangle
MIR_CLIENT_PLATFORM_PATH=bin/../lib/client-modules/
MIR_SERVER_PLATFORM_PATH=bin/../lib/server-modules/
MIR_SERVER_PLATFORM_INPUT_LIB=bin/../lib/server-modules/input-stub.so
LD_LIBRARY_PATH=bin/../lib
exec=bin/.mir_demo_client_egltriangle-uninstalled
Current active output is 1366x768 +0+0
Server supports 2 of 6 surface pixel formats. Using format: 4
Surface 0 DPI
Signal 1 received. Good night.
[1440045852.340914] <ERROR> Mesa/NativeSurface: Caught exception at Mir/EGL driver boundary (in advance_buffer): /home/dan/bzr/mir/r2730/src/client/buffer_stream.cpp(316): Throw in function virtual void mir::client::BufferStream::request_and_wait_for_next_buffer()
Dynamic exception type: N5boost16exception_detail10clone_implINS0_19error_info_injectorISt13runtime_errorEEEE
std::exception::what: new buffer unavailable

[1440045852.345271] <ERROR> MirSurfaceAPI: Caught exception at client library boundary (in mir_surface_release): /home/dan/bzr/mir/r2730/src/client/rpc/stream_socket_transport.cpp(168): Throw in function virtual void mir::client::rpc::StreamSocketTransport::send_message(const std::vector<unsigned char>&, const std::vector<mir::Fd>&)
Dynamic exception type: N5boost16exception_detail10clone_implINS0_19error_info_injectorIN3mir25socket_disconnected_errorEEEEE
std::exception::what: Failed to send message to server: Broken pipe
32, "Broken pipe"
[1440045852.345615] <ERROR> MirConnectionAPI: Caught exception at client library boundary (in release): /home/dan/bzr/mir/r2730/src/client/rpc/stream_socket_transport.cpp(168): Throw in function virtual void mir::client::rpc::StreamSocketTransport::send_message(const std::vector<unsigned char>&, const std::vector<mir::Fd>&)
Dynamic exception type: N5boost16exception_detail10clone_implINS0_19error_info_injectorIN3mir25socket_disconnected_errorEEEEE
std::exception::what: Failed to send message to server: Broken pipe
32, "Broken pipe"

Changed in mir:
status: New → Triaged
Revision history for this message
Daniel van Vugt (vanvugt) wrote :
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Is this a bug we should take seriously?

We guarantee a binary API so that we *can* change the protocol underneath.

Mixing API implementations from different versions of Mir has never been actively supported (even if it often worked).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

We have never changed protocol backward compatibility except by accident. And it has mostly always worked.

Mixing API implementations is supported and the technologies we use are explicitly chosen for this reason.

Retaining protocol backward compatibility is the reason we use protobuf and have binary packages with the ABIs in their names (so that you can install multiple different versions at once).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Even if we took the decision to knowingly break some protocol semantics, that should be done in a way that checks the protocol support of whatever is on the other end of the socket. And so return nice errors instead of just hanging waiting for messages that never come.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Of course, some recently built clients can't be run against the archive API:

$ objdump -T bin/mir_demo_client_eglplasma.bin | grep MIR_CLIENT_9.1
0000000000000000 DF *UND* 0000000000000000 MIR_CLIENT_9.1 mir_connection_get_egl_pixel_format

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

But clients that can link against the archive AP work. Vis:

$ sudo --vt 2 mir_demo_server --launch-client bin/mir_demo_client_multiwin.bin

Works (as expected).

$ sudo --vt 2 mir_demo_server --launch-client bin/mir_demo_client_multiwin

Fails as described.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

> Is this a bug we should take seriously?
>
> We guarantee a binary API so that we *can* change the protocol underneath.
>
> Mixing API implementations from different versions of Mir has never been actively supported (even if it often worked).

Although we have never provided explicit guarantees, not providing backward compatibility can be frustrating for users. The problem is that client programs may stop working unexpectedly if we release a client library with an new ABI that also has a protocol break. In this case all client programs should be recompiled to work with the new server, but there is no indication or warning about this. The user just finds out the hard way and is confused.

Solutions:
 1. Stabilize client ABI (so clients will automatically used client libraries with updated protocol without relinking)
 2. Ensure protocol is backwards compatible.
 3. Introduce checks in the client/server to check protocol compatibility with server, and fail with an informative message

I think we should at least do (3).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

This bug is about the client bringing its own newer libmirclient (if required) that does work with it.

Yes, you have demonstrated things that are broken. None of them are an excuse for silently hanging. We should either just work or give an error message. Given kdub's already got a branch in progress to try and make things just work, let's try that first...

Changed in mir:
assignee: nobody → Kevin DuBois (kdub)
status: Triaged → In Progress
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Alexandros, this isn't about needing to rebuild clients. I can break the system clients in the same way:

$ sudo mir_demo_server --vt 2&
$ LD_LIBRARY_PATH=lib MIR_CLIENT_PLATFORM_PATH=lib/server-modules/ mir_demo_client_multiwin

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

$ LD_LIBRARY_PATH=lib MIR_CLIENT_PLATFORM_PATH=lib/client-modules/ mir_demo_client_multiwin

actually.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

It's not the same way. In the scenario where I found this bug I had:
   A phone with Mir server 0.14.
   And a client with its own newer libmirclient from 0.15.

That might sound like a developer-only problem but it's not. In a Snappy world, a client may well bring its own newer libmirclient for some new function that it needs.

I totally get that you might want to change the underlying transport at some stage. Even not use sockets at all... However if that were true then you would have to start by tightly coupling libmirserver and libmirclient, probably using a single ABI number between them. To imply that libmirserver33 only speaks the same protocol as libmirclient9 is very bad communication. It needs to be more obvious.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Re: [Bug 1486496] Re: [regression] Socket level compatibility has been lost - newer clients connect to a Mir 0.14 server but freeze, never rendering any frames

On 20/08/15 10:26, Daniel van Vugt wrote:
> It's not the same way. In the scenario where I found this bug I had:

It *is* the same way.

The description talks about running a "wrapper" executable that sets
LD_LIBRARY_PATH and then execs the ".bin".

Setting LD_LIBRARY_PATH and running the installed client is the same.

> A phone with Mir server 0.14.
> And a client with its own newer libmirclient from 0.15.

But surely in any sane world installing the ABI compatible 0.15
libmirclient replaces the 0.14 one?

> That might sound like a developer-only problem but it's not. In a Snappy
> world, a client may well bring its own newer libmirclient for some new
> function that it needs.

So *Snappy* requires socket-level compatibility? That doesn't work.

--
Alan Griffiths +44 (0)798 9938 758
Octopull Ltd http://www.octopull.co.uk/

Revision history for this message
Alan Griffiths (alan-griffiths) wrote : Re: [regression] Socket level compatibility has been lost - newer clients connect to a Mir 0.14 server but freeze, never rendering any frames

> a client may well bring its own newer libmirclient for some new
> function that it needs.

I don't understand how that is expected to work. In almost all cases a new function in the client will need to be supported by new functionality in the server. So will this client also bring a suitable server?

Revision history for this message
Kevin DuBois (kdub) wrote :

The linked branch wasn't aimed at addressing this problem, removed link

Revision history for this message
Kevin DuBois (kdub) wrote :

So, there were really two transitions:
The first transition was from using "rpc exchange_buffer" to using "rpc submit_buffer"+async-buffer-transfer.
The second is from using one-buffer-at-a-time rules to using multi-buffers. (this is the transition that I'm working on now).

This thread is about the first one. We don't have a handshake of some sort so that the client can figure out whether it should wait for an async buffer, or if it should wait for the return of the IPC call.

I don't think there's a strong driving case (or perhaps even a reasonable requirement) for full backwards compatibility, but I think the weaker case of being nice to people that are trying out weird things (alexandros's case 3) is something that strikes the right balance of effort and value to weird-thing-triers. To be more general, trying an old server with a new client (say, a non-protobuf-protocol), is something that doesn't seem worthwhile.

Maybe this bug is solved by prioritizing this card more highly https://trello.com/c/3YlSllt7 , and implementing that feature.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

While not described explicitly the original description is about running clients that forcefully (via LD_LIBRARY_PATH) use a libmirclient that isn't the one supported by the system mirserver. In development we do run both servers and clients that use different libs than the system install like this, but if they work with the system version that is a happy accident, not something we try (or want) to support.

Daniel hypothesizes that a developer might package libmirclient with a client app "for some new function" in a Snappy package. But regardless of protocol or API compatibility it is wrong to expect that a Mir server provides support for client functions that didn't exist when the server was developed. The libmirclient /needs/ to be in step with the libmirserver and we avoid client ABI breaks so that new libmirclients can replace old when the system is updated. So, however Snappy packages are created, if the app bundles libmirclient it must also bundle the server it talks to (which I think is ridiculous).

In either case a *developer* is intentionally (or accidentally) doing something we cannot support.

I don't think it is worth /any/ effort to support this scenario. But if we can close the discussion by delivering an error message if libmirclient and libmirclient are not part of the same Mir release then that may be the easier path.

Revision history for this message
kevin gunn (kgunn72) wrote :

@snappy portion of the discussion

So for Ubuntu Personal, we're going to be image based...just like on the phone. So there really is no issue there for mirclient/server versioning. e.g. we'll be in control and always have the right combo's present. So no bug there.

Now, for what we've been calling "kiosk", so UbuntuCore + a mir snap + an app snap which is a mir client....it's conceivable. But! I'd say that in reality anyone will always dev on top of a specific mir snap version. And then in their finished product install the snaps they want together. Moreover, even in this case the "mir snap" today is a framework....and frameworks are still experiemental. heck we're not even 100% sure mir will be a framework, there's other ideas about collections of headers and libs, one buzz word "dev pack" which may or may not end up being a thing. Point being, it's either not an issue (e.g. finished product ships matched snaps) or it's too early (e.g. fmwk snaps are still unofficial)

Changed in mir:
assignee: Kevin DuBois (kdub) → nobody
importance: High → Wishlist
status: In Progress → Invalid
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I spent half a day bisecting this issue and found the exact point it stopped working. The bug is very real even if you think we shouldn't fix it.

Changed in mir:
status: Invalid → Triaged
importance: Wishlist → Medium
summary: - [regression] Socket level compatibility has been lost - newer clients
- connect to a Mir 0.14 server but freeze, never rendering any frames
+ [regression] Clients hang without explanation when connecting to older
+ servers since r2730.
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Adjusted the bug description so that a simple error message is enough to resolve it.

description: updated
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

I'm not happy that the new description reflects something that has an impact outside a small group of developers who should know better.

We *do* get a meaningful error message if we run a "new" client without messing with LD_LIBRARY_PATH:

    $ bin/mir_demo_client_eglplasma.bin
    bin/mir_demo_client_eglplasma.bin: /usr/lib/x86_64-linux-gnu/libmirclient.so.9: version `MIR_CLIENT_9.1' not found (required by bin/mir_demo_client_eglplasma.bin)

"New" clients that don't use APIs not supported the normal libmirclient run just fine:

    $ bin/mir_demo_client_multiwin.bin
    Signal 15 received. Good night.

The problem described only arises /when using a launcher to override the libraries used/ by the client program. (As you are aware w use this launcher in development to force servers and clients to use the development libraries - but that isn't a "normal" user scenario.)

I believe there is an underlying issue, but the problem scenario is different to both the original an current descriptions in the bug...

We intend to support side-by-side installations of different libmirserver versions.The client .so will correspond to the more recent server[1] but using a server application that links to the older server library will cause /all/ clients to fail connecting to that server through the new client library as described above.

The impact of this scenario is currently low as we are able to bundle updates to all downstream servers with the release of Mir - but it is indeed a problem we need to address. (I'm working on a short term "fix" that provides an error message, but more is required.)

--
[1] In 0.14 we landed breaking changes to the client ABI. A significant one of these was to hide the data structures that caused many previous ABI breaks - so client ABI breaks are much less likely than they've been in the past.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

There is another use case where this bug is important:

Both libmirclient8 and libmirclient9 are installed and some clients still use the older one. I've just discovered this is very real on a real device (arale). GTK3 is still bound to the older version. And that's not unreasonable either -- we don't expect all clients/toolkits to upgrade to the newest Mir version instantly. And while they are out of sync, those older clients still need to be able to talk to newer servers.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Comment #25 shows this bug can affect real users too. If and when we bump the client ABI and the archive is not fully updated atomically, we need those different Mir releases to be socket/protocol compatible with each other. Even if only briefly.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

*This* bug is about servers not being /forward/ compatible to newer clients. That is something we can't support fully as the server often won't have the functionality the client needs. A decent error message is, for the general case, the best we can achieve here.

My reading of comment #25 is that it is about the converse situation: servers being /backwards/ compatible to older clients. This is more tractable. However, our intended approach is to provide ABI compatibility via the client library and the libmirclient8 to libmirclient9 break while painful was intended to make this achievable in future.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

That's true, they're different things. I thought I found the bug occurred both ways the other day but can't reproduce that today.

"the best we can achieve" is semantic compatibility. And we did have it for some time up to r2729 (I only tested back to Mir release 0.12.1 though). An error message is a useful solution, but not the ideal solution.

Given we're capable and have the knowledge of exactly where the break occurred, I don't think it's out of the question to try and fix it completely. No error message required.

Changed in mir:
assignee: nobody → Alan Griffiths (alan-griffiths)
status: Triaged → In Progress
description: updated
Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

mir (0.16.0+15.10.20150921.1-0ubuntu1) wily; urgency=medium

Changed in mir (Ubuntu):
importance: Undecided → Medium
status: New → Fix Released
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Related discussion now continues in bug 1507982.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.