[regression] OTA7 broke previously working app

Bug #1507982 reported by dinamic
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Won't Fix
Undecided
kevin gunn
Mir
Won't Fix
High
Unassigned
0.16
Won't Fix
High
Unassigned
0.17
Won't Fix
High
Unassigned
mir (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

OTA7 broke previously working app
after OTA7 update on Meizu MX4, glmark2 stopped working https://uappexplorer.com/app/glmark2.sturmflut
system updates should not break apps

~/.cache/upstart$ cat application-click-glmark2.sturmflut_glmark2_0.4.1.log
libust[24877/24879]: Error: Error opening shm /lttng-ust-wait-5-32011 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24877/24879]: Error: Error opening shm /lttng-ust-wait-5-32011 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24877/24878]: Error: Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24877/24878]: Error: Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:958)
Error: Couldn't connect to the Mir display server
Error: main: Could not initialize canvas
[1445336019.182277] <ERROR> MirConnectionAPI: Caught exception at client library boundary (in release): /build/buildd/mir-0.13.3+15.04.20150617/src/client/rpc/stream_socket_transport.cpp(168): Throw in function virtual void mir::client::rpc::StreamSocketTransport::send_message(const std::vector<unsigned char>&, const std::vector<mir::Fd>&)
Dynamic exception type: N5boost16exception_detail10clone_implINS0_19error_info_injectorIN3mir25socket_disconnected_errorEEEEE
std::exception::what: Failed to send message to server: Broken pipe
32, "Broken pipe"
libust[24922/24923]: Error: Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24922/24923]: Error: Error opening shm /lttng-ust-wait-5 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24922/24924]: Error: Error opening shm /lttng-ust-wait-5-32011 (in get_wait_shm() at lttng-ust-comm.c:958)
libust[24922/24924]: Error: Error opening shm /lttng-ust-wait-5-32011 (in get_wait_shm() at lttng-ust-comm.c:958)
Error: Couldn't connect to the Mir display server
Error: main: Could not initialize canvas
[1445336029.784398] <ERROR> MirConnectionAPI: Caught exception at client library boundary (in release): /build/buildd/mir-0.13.3+15.04.20150617/src/client/rpc/stream_socket_transport.cpp(168): Throw in function virtual void mir::client::rpc::StreamSocketTransport::send_message(const std::vector<unsigned char>&, const std::vector<mir::Fd>&)
Dynamic exception type: N5boost16exception_detail10clone_implINS0_19error_info_injectorIN3mir25socket_disconnected_errorEEEEE
std::exception::what: Failed to send message to server: Broken pipe
32, "Broken pipe"

Related branches

Changed in canonical-devices-system-image:
assignee: nobody → kevin gunn (kgunn72)
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Hi dinamic, good to hear from you again.

We do know of a protocol compatibility break that occurred in Mir 0.15.0, which was first reported as bug 1486496 (which due to disagreements got reworded and watered down such that mir-team was not required to fix the regression). I also mentioned back in August that it could become a problem for Snappy: https://bugs.launchpad.net/mir/+bug/1486496/comments/15

Although I was trying for protocol backward-compatibility at the time, that's not something we can realistically maintain forever. And it was rightly pointed out that in a traditional packaging system the problem mostly would not exist as even old clients would start using the new (shared!) libmirclient/libmirprotobuf automatically.

So this really is mostly just an issue we'll see in Snappy where apps carry their own libmir* client libraries. Here are some possible ways forward for us:

  (a) App developers: Maybe ensure apps don't package libmir* and instead rely on that of the system/mir framework already installed; or
  (b) ~mir-team: make a conscious decision to maintain socket protocol level compatibility for the lifetime of the given Ubuntu Touch series (which unfortunately also means supporting old buffer semantics too). Kind of analogous to what people usually do with ABIs; or
  (c) ~mir-team: As we know the precise point of regression (r2730), make an effort to reinstate protocol backward compatibility (although this suggestion was met with strong opposition previously, and admittedly would be difficult to maintain forever).

Options (a) and (b) are most likely and unfortunately both require your app packages to get rebuilt/re-released. Maybe there are more options?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

To clarify; I think this bug is effectively a duplicate of the problem I originally reported in:
   https://bugs.launchpad.net/mir/+bug/1486496/comments/0

However bug 1486496 was later reworded into a form that was less contentious and did not require any resolution of the protocol compatibility issue. So now this bug 1507982 actually represents that original problem. It's not a duplicate and still not really resolved.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mir (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Regarding option (c): It's possible that r2730 is no longer the only offending commit. We may have landed other similar breaks since then... kdub is the expert on all the related protocol semantic changes of late.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, I have now hacked together a test environment on wily desktop and tried:

   0.13.3 client --> 0.17.0 server (wily) FAILS
   0.13.3 client --> 0.16.2 server (lp:mir/0.16) FAILS
   0.13.3 client --> 0.15.2 server (lp:mir/0.15) WORKS

I can confirm the primary issue is the first error:

   Error: Couldn't connect to the Mir display server

which is probably the result of: lp:~alan-griffiths/mir/fix-1486496

Interestingly it works with the server built from 0.15.2. I presume that's just because the "fix" for bug 1486496 is not present. So perhaps the only hurdle here is the fix for bug 1486496. It seems if you don't have that, then a 0.13.3 client can connect and works fine.

Changed in mir:
status: New → Triaged
Changed in mir (Ubuntu):
status: Confirmed → Triaged
Changed in mir:
importance: Undecided → High
Changed in mir (Ubuntu):
importance: Undecided → High
summary: - OTA7 broke previously working app
+ [regression] OTA7 broke previously working app
Changed in mir:
milestone: none → 0.18.0
tags: added: regression
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Digging further explains why this bug appears in OTA-7. It's because OTA-7 upgraded the Mir server from 0.15.1 to 0.16.0:
   https://wiki.ubuntu.com/Touch/ReleaseNotes/OTA-7

Changed in mir:
assignee: nobody → Daniel van Vugt (vanvugt)
status: Triaged → In Progress
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Verified that removing the fix for bug 1486496 allows the client to start up and render. So that's the first hurdle.

Unfortunately there's a second hurdle we may not be able to overcome -- that is with Mir 0.16.0 we broke input event/protocol compatibility. So 0.13.3 clients never receive input when connected to a Mir 0.16/17/18 server. Not sure if we can fix that... anpok?

Changed in mir:
status: In Progress → Triaged
assignee: Daniel van Vugt (vanvugt) → nobody
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Perhaps if we want apps to keep working at least for the lifetime of UbuntuTouch-vivid, then we will need to wind back in OTA8 and keep Mir on the 0.15 series.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alternatively, we will need to ask all app developers to rebuild their packages with Mir 0.16.0 or later :(

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Actually comment #9 might be feasible. If we're only asking developers of native Mir apps (not Qt) to rebuild...

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

*not QML, I mean.

Revision history for this message
dinamic (dinamic6661) wrote :

Hi Daniel o/

"Alternatively, we will need to ask all app developers to rebuild their packages with Mir 0.16.0 or later :("

i don't think there are many to ask, probably just sturmflut :>>

Revision history for this message
kevin gunn (kgunn72) wrote :

the good news is, the promise of ABI compatibility will be maintained with libimrclient9, so a one time rebuild should fix it going fwd

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Bisected and found the source of the input regression in 0.16.0. It came from:

------------------------------------------------------------
revno: 2867 [merge]
author: Brandon Schaefer <email address hidden>
committer: Tarmac
branch nick: development-branch
timestamp: Thu 2015-08-20 23:33:42 +0000
message:
  Add a mac field for key/touch/pointer events.

  Approved by PS Jenkins bot, Alexandros Frantzis.
------------------------------------------------------------

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Sorry. It appears in r2867 (which went into Mir 0.16.0) we broke the binary input event protocol (which is separate to protobuf). It's sufficiently broken that any Mir-0.15 or earlier client will reject and ignore all input messages it gets from v0.16 or newer servers.

http://bazaar.launchpad.net/~mir-team/mir/development-branch/revision/2867#3rd_party/android-input/android/frameworks/base/include/androidfw/InputTransport.h

It would be difficult and messy to repair the problem sufficiently such that the old and new binary formats could be supported simultaneously. Not impossible, but probably not something we're going to invest in right now.

That plus r2893 made the problem worse by rejecting older clients outright, possibly without knowing the intricacies of the r2867 regression. Again, sorry this happened. The relevant people have been informed so hopefully it won't happen again.

Changed in mir:
status: Triaged → Won't Fix
Changed in mir (Ubuntu):
status: Triaged → Won't Fix
Changed in mir:
milestone: 0.18.0 → none
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Annoyingly comment #9 may not work. If app developers rebuild with the latest Mir (0.17) then the slightly older Mir 0.16.0 in OTA7 will also reject their connections because they're now too new! This is again due to the overly strict protocol range checking introduced in r2893.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Comment #16 is misleading:

There is only a "problem" if the app dynamically links against Mir-0.16 and attempts to run against a Mir-0.17 server. This should not happen as either:

    /1/ the matching server and client libraries are installed as part of the system (desktop, phone); or,
    /2/ the matching server and client libraries are installed as part of a snap (kiosk).

In both cases the client library matches the server and everything works.

The protocol range checking in r2893 correctly detects that the version of libmirclient.so.9 that is trying to connect supports that are not supported by the server and that the system is misconfigured. This is not something that App developers need to be concerned with.

~~~~

For the avoidance of doubt, Apps built with the libmirclient.so.9 from Mir-0.17 and installed on a phone configured with Mir-0.16 will dynamically link and run correctly (case 1 above) unless they refer to the new functions introduced in 0.17. Vis:

MIR_CLIENT_9v17 {
    mir_blob_from_display_configuration;
    mir_blob_size;
    mir_blob_data;
    mir_blob_release;
    mir_blob_onto_buffer;
    mir_blob_to_display_configuration;
    mir_blob_release;
    mir_buffer_stream_set_scale;
    mir_buffer_stream_set_scale_sync;
    mir_event_get_surface_output_event;
    mir_surface_output_event_get_dpi;
    mir_surface_output_event_get_form_factor;
    mir_surface_output_event_get_scale;
} MIR_CLIENT_9.2;

If the app requires these functions then it obviously can't use a libmirclient.so.9 from Mir-0.16 and will fail to run even before connecting to the server.

Similar arguments apply running on a phone configured with or Mir-0.15 or Mir-0.14.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The "matching" you mention may not happen in the snappy case. So there is a problem.

In the pure snappy case, an app brings its own libmirclient and potentially its own libmirprotobuf and libmir*. So it is not a binary "match" for the server in question. And the app may be using Mir binaries that are newer (most likely) or older (e.g. this bug) than that of the server. The only commonality is the protocol and negotiation needs to happen there. It's not related to libraries and ABIs at all. Unless.....

We could mitigate the problem by:
  (a) Never bumping the client ABI again; and
  (b) Ensuring app snaps don't bring their own libmir*, so do use the system copy.
However that only works for as long as we can ensure both (a) and (b) are true.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

In the pure snappy case there is no "system" Mir server. The Mir server needs to be included in the snap.

@a - see comment #13
@b - we can't police every wrong way to do things[*].

[*] A (hypothetical) idiot could create a .deb that includes libmirclient.so.9 instead of depending on libmirclient9 - it would break but we are not responsible for preventing it. I don't see that snaps are any different.

tags: added: regression-release
removed: regression
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Please keep that tag. I use it for periodic quality analysis.

tags: added: regression
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alan,
I think it's likely in the future (if not already) that the Mir server/shell will be a separate snap to any app snaps. And certainly it's difficult to make anything completely foolproof, because fools are so ingenious. But we can build things that are more robust in the face of potential future mistakes. So if those future mistakes include bringing your own libmir* then it would be nice if that worked more often. Even if we don't like the idea, it's better than dealing with high severity bug reports.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Also bringing your own libmir* is not a mistake and not "foolish". It may be different to traditional Linux packaging, but it's the way snappy is designed for good reason.

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

We don't know how things will be done in the future, but I do know what Mir supports in the present. That is maintaining backward compatibility for the libmirclient9 client ABI. We have tests to ensure this.

Mir does not support "bringing your own libmir*" on just the client side - you also have to "bring" the server side. That is the way it is.

If you disagree, please point me at any tests for this functionality. If you want to change it please convince the team that it is needed and provide the tests for a supported "feature".

kevin gunn (kgunn72)
Changed in canonical-devices-system-image:
status: Confirmed → Won't Fix
Revision history for this message
kevin gunn (kgunn72) wrote :

1) talked to sturmflut, he's agreed to rebuild against libmirclient9
2) did a scan of store, so far only libmirclient9 is already being used by a couple of others

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Hmm, actually bug 1498281 might be the cause here (even if libmirclient8 isn't actually in the click package).

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir at revision None, scheduled for release in mir, milestone 0.20.0

Changed in mir:
status: Won't Fix → Fix Committed
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

"Related branch", not a fix.

Changed in mir:
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.