[MIR] thumbnailer

Bug #1613561 reported by Michi Henning
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
thumbnailer (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

[Availability]
 * Available in universe

[Rationale]
 * This package is required by many Ubuntu Touch applications and scopes, such as gallery app, music app, today scope, music scope, etc.

[Security]
 * No known security issues at this time. It has been reviewed by security in the past for use on the phone.

[Quality assurance]
 * This package has unit tests.

[Dependencies]
 Most dependencies are already in main with the exception of the following:
 * libboost-filesystem-dev
 * libunity-api-dev (Bug #1613563)
 * persistent-cache-cpp-dev (Bug #1613560)

 Note that the package has other dependencies that are not in main, but these are used only for the tests and are not runtime dependencies.

[Standards compliance]
 * This package uses cmake.

[Maintenance]
 * This package is maintained by Canonical and actively in use on the phone images.

description: updated
Revision history for this message
Michael Terry (mterry) wrote :

- Why do we install an apport blacklist for vs-thumb?
- Are the ftbfs in -proposed something we have a fix for? (Is that the silo that is held up by a failing u8 test?)

Otherwise seems fine.

Changed in thumbnailer (Ubuntu):
status: New → Incomplete
Revision history for this message
Michi Henning (michihenning) wrote :

We are running gstreamer out of process (inside vs-thumb) because several codecs have problems and can cause infinite loops, hangs, or segfaults. If vs-thumb crashes, we don't want to generate a bug report because, sadly, the crashes happen quite often.

The ftbfs is the unity8 problem, yes.

Revision history for this message
Michi Henning (michihenning) wrote :

Unfortunately, silo 54 failed to build again, this time anything qt related fell over on yakkety. Yesterday, the same silo built fine (just didn't pass autopkg tests). I'm getting the impression that not all of the Arm builders have the same qt version installed. It's the only reason I can think of why the silo sometimes builds and sometimes fails on Arm.

Revision history for this message
Michi Henning (michihenning) wrote :

Silo 54 is approved for QA. I was told not to publish this until after OTA-13. If this needs to go in now, someone let me know please, and I'll publish it.

Revision history for this message
Matthias Klose (doko) wrote :

will stay as incomplete until the ftbfs is fixed.

Revision history for this message
Michi Henning (michihenning) wrote :

Well, it's not ftbfs anymore. More pertinently, should I publish now or wait until after OTA-13?

Revision history for this message
Matthias Klose (doko) wrote :
Revision history for this message
Michi Henning (michihenning) wrote :

I didn't even know that there was a failed build. Looking at the build log, the test failures are almost certainly caused by a flaky builder. We have seen this sort of thing recently on another project too.

Revision history for this message
Matthias Klose (doko) wrote :

still ftbfs on arm64 and ppc64el, where it built before:
https://launchpad.net/ubuntu/+source/thumbnailer/2.4+16.10.20160825-0ubuntu1

Revision history for this message
Michi Henning (michihenning) wrote :

The tests that are failing are all Qt related. I suspect a problem in Qt rather than our code. (The thumbnailer code hasn't changed in ages, and the fact that it fails only on Yakkity suggest a problem with one of our dependencies.)

8: thumbnailer-service: [23:19:59.671] failure cache: 0 entries, 0 bytes, hit rate 0.00 (0/0), avg hit run 0.00, avg miss run 0.00
8: Segmentation fault (core dumped)
8:
8: ---- Xvfb log file ----
8: _XSERVTransmkdir: ERROR: euid != 0,directory /tmp/.X11-unix will not be created.
 8/26 Test #8: image-provider ..............................***Timeout 1500.00 sec

Might indicate some X11 related issue? I we don't normally see this.

Similar trace here:

9: ********* Start testing of Thumbnailer *********
9: Config: Using QtTest library 5.6.1, Qt 5.6.1 (arm64-little_endian-lp64 shared (dynamic) release build; by GCC 6.2.0 20160830)
9: PASS : Thumbnailer::OnlineArt::initTestCase()
9: Segmentation fault (core dumped)
9:
9: ---- Xvfb log file ----
9: _XSERVTransmkdir: Owner of /tmp/.X11-unix should be set to root
 9/26 Test #9: qml .........................................***Timeout 1500.00 sec

The following looks like it simply got stuck and ended up being killed eventually. No trace at all:

10: thumbnailer-service: [00:09:59.863] failure cache: 0 entries, 0 bytes, hit rate 0.00 (0/0), avg hit run 0.00, avg miss run 0.00
10/26 Test #10: libthumbnailer-qt ...........................***Timeout 1500.00 sec
[==========] Running 20 tests from 1 test case.

I very much doubt that the problem is in our code. I have no idea how to do track this down. I don't have an arm64 or ppc machine. Last time I tried to use one of the porter boxes, it wasn't possible to install the overlay on them, so that looks like a no-go as well :-(

Revision history for this message
Michi Henning (michihenning) wrote :

So, as best as we can tell, this is caused by an upgrade to Qt. Specifically, anything that uses QDBus is falling over. We are seeing similar problems with unity-scope-click here:

https://bugs.launchpad.net/ubuntu/+source/qtbase-opensource-src/+bug/1618590

Basically, the code works with Qt 5.5, and doesn't with Qt 5.6. The ftbs for thumbnailer here

https://launchpad.net/ubuntu/+source/thumbnailer/2.4+16.10.20160825-0ubuntu1

shows that the tests failed only on yakkety arm64 and ppc64el. That's because that silo was built before Qt 5.6 was added to the Xenial overlay. If I were to build in a silo now, I expect I'd see the same failures on both xenial and yakkety.

Not sure what we can do to help here. The problem isn't in the thumbnailer code, but somewhere in QDBus, by the looks of things.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

As with unity-scope-click, you could try out https://requests.ci-train.ubuntu.com/#/ticket/1960 which has two pending QDBus patches that are however not yet merged or approved in upstream.

We had also problems with telepathy-qt and signon with QDBus, but two patches were landed that made QDBus ok with those. However, do note that the QDBus rewrite in Qt 5.6 from event loop based to threaded might easily require changes in our code, as it did for telepathy-qt:

http://launchpadlibrarian.net/280893997/telepathy-qt_0.9.6.1-6ubuntu1_0.9.6.1-7ubuntu2.diff.gz
http://launchpadlibrarian.net/284273846/telepathy-qt_0.9.6.1-7ubuntu2_0.9.6.1-9ubuntu6.diff.gz

Maybe those could give some ideas on what to change in thumbnailer?

As Qt 5.6 is a long term supported release, we can also write test cases and file bugs upstream https://bugreports.qt-project.org/ to get attention to problems we have.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

For the record, the so far included post-5.6.1 patches in our packages:

https://codereview.qt-project.org/#/c/167480/
https://codereview.qt-project.org/#/c/170356/

Thiago is the main contact in upstream.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

And I'm sorry, I did do rebuilds of 38 other Qt using packages in our phone stack in July/August in order to file bugs (like the signon and telepathy-qt ones), but it seems thumbnailer was indeed not in my package list before this.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Ok I've now finished a rebuild and realized it would not have raised any flags for me earlier since my test silo only early built only for amd64, armhf and i386. I've now added arm64 to the mix.

This QDBus problem seems more architecture specific than the others were.

tags: added: qt5.6
no longer affects: qtbase-opensource-src (Ubuntu)
tags: removed: qt5.6
Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

Since this is a MIR bug I've filed bug #1625930 separately.

Revision history for this message
Michi Henning (michihenning) wrote :

Silo 1968 contains a no-change rebuild of the thumbnailer. As expected, things are again blowing up for arm64 and ppc64el on yakkety.

Interestingly, things worked with xenial, even though the same QDBus is installed there, as far as I can see.

I don't think the telepathy code changes you linked to are related to what we are seeing. I'll double-check tomorrow.

I'm asking the trainguards to copy the packages from 1960 into 1968, so we can check if the upstream patches affect what we are seeing.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

It does not seem to help building with those two additional QDBus patches (qtbase 5.6.1+dfsg-3ubuntu6~1):

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/1968/+build/10936648/+files/buildlog_ubuntu-yakkety-ppc64el.thumbnailer_2.4+16.10.20160921-0ubuntu1_BUILDING.txt.gz

Another full rebuild confirms that the problem is not there on xenial+overlay: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/1969/+packages

Xenial+overlay and Yakkety have identical Qt packages, so this points to the direction of GCC6 or glibc 2.24, either via how they affect Qt or some other component.

Revision history for this message
Michi Henning (michihenning) wrote :

I install a yakkety arm64 chroot on my M10 and built the thumbnailer. The tests run fine, over and over. I do not see any failures.

dpkg reports:

libc6:arm64 2.24-0ubuntu1 arm64
gcc-6 6.2.0-3ubuntu15 arm64
libqt5dbus5:arm64 5.6.1+dfsg-3ubuntu5~1 arm64

Is it possible that there is a problem with the builder we are running this on?

I'm fresh out of ideas what else I could check locally.

Revision history for this message
Michi Henning (michihenning) wrote :

Another option: it could simply a race that shows up in the train only because the builders are VMs (I believe) and the underlying hardware might be loaded more heavily. I'd expect CPU and I/O timings to be different in the train than on my M10. So, it is still possible that this is simply a race QDBus that gets tickled only when we run the tests in the train.

Revision history for this message
Michi Henning (michihenning) wrote :

I decided to kick the build in silo 1968 to see whether the fault would move. This time, it failed on yakkety on arm64 and ppc64el as before, but also failed on xenial powerpc. The failure on xenial is in a test that uses QDBus to talk between client and server. The client fails because it doesn't any responses from DBus, suggesting that the server crashed.

Kicking off another build now. It sure looks more and more like a race in QDbus.

Revision history for this message
Michael Terry (mterry) wrote :

If we do believe that the test failures are specific to Qt and thus not something this package can do something about, I'm fine with a targeted disabling of tests until we can root-cause Qt. No reason to block on something we can't yet fix.

*Ideally* we'd disable only the flaky tests, only on yakkety, and only on arm64 and ppc64el. With a comment pointing to the qt bug so we know when we can re-enable them.

I'll take as much of that as I can get (i.e. if you have to disable all tests on those distro/arches instead, OK...)

ppc64el wouldn't normally bother me TOO much. It's not a targeted platform for this code at the moment. But arm64 is (or shortly will be), and the failure there is worrisome.

I don't see evidence that this got a packaging look through. I'll do that now.

Revision history for this message
Michael Terry (mterry) wrote :

Oh, no I did earlier in comment #1. :P

OK, if you can whip up a targeted test-disabling with comment why, this should be fine.

Revision history for this message
Michi Henning (michihenning) wrote :

The failure happens in both yakkity and xenial (because both use Qt 5.6). The failure moves around. In the latest build, we got a failure on xenial ppc64el (which is a first).

If we disable tests, we have to disable *all* integration tests that use DBus for *both* xenial and yakkety. We haven't seen a failure on vivid so far, and I don't expect that we will, seeing that vivid uses an earlier version of Qt.

I'm uncomfortable about turning off tests wholesale like this though. In particular, we have a lurking time bomb here. Absolutely anything that uses QDBus with Qt 5.6 can blow up without warning. I'm fairly sure that this is a race condition that shows up mainly because the timings when we run on the build machines are different. So, we may end up with tons of seemingly random failures in the field for anything that uses QDBus, just because a machine is more heavily loaded than usual. Random segfaults are very hard to debug if they are not reproducible :-(

Is there any chance of backing out of the Qt 5.6 upgrade? It doesn't look like that version is ready for prime time yet.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

I think disabling the tests on xenial is a bad idea since they seem to pass most of the time (every time I tried), while they never pass on yakkety. Since xenial + Qt 5.6 is the next stable platform and the tests pass there (possibly with some flakiness still), there's value to keep them enabled.

The single xenial failure - https://launchpadlibrarian.net/285955337/buildlog_ubuntu-xenial-ppc64el.thumbnailer_2.4+16.04.20160922.1-0ubuntu1_BUILDING.txt.gz - was in one out of ten subtests and failed differently, while the yakkety failures are always similar and fail all subtests in 5 different tests.

It's currently unclear if the issue is due to Qt, GCC6, glibc 2.24 or the interaction between these, so saying it's because of the Qt 5.6 is imprecise for now, although the refactored QDBus code can clearly still have further bug fixes. This thumbnailer problem seems just a bit different since it's architecture specific, which could point towards other toolchain problems. It seems there are no similar problems on identical Qt version but with GCC5 and older glibc (xenial).

One idea would be to try building with GCC5 on yakkety and see what the PPA builders would say about that. Older glibc cannot however be easily used.

It's not possible to back out Qt 5.6, as Qt 5.5 is already end-of-life while 5.6 is upstream LTS release, supported until 2019. The best plan is to get bugs filed in upstream for their supported release with a test case that can produce the crash (without Ubuntu dependencies). And if the problem can be pinpointed to the compiler or C library, likewise bug reports over there. Qt 5.6 is now also already the minimum version required by other flavors, moving to 5.7 by 17.04.

Please also discuss with boiko on what he learned of the QDBus behavior changes when looking at telepathy-qt, maybe there's something to improve in thumbnailer regardless.

Revision history for this message
Michi Henning (michihenning) wrote :

Hmmm... The xenial failure is highly suspicious because that particular test has never failed before. I'm quite confident that the failure exists on xenial too, just doesn't manifest as often.

I did look at the telepathy code changes, and they address a completely different issue. What we are seeing is different, I believe.

If you consider what's proposed now unacceptable, what should we do? We can disable for yakkety only, or for yakkety arm64 and ppc64el only, or whatever. Please let me know, so we can get this unblocked.

Revision history for this message
Timo Jyrinki (timo-jyrinki) wrote :

I'd go for disabling yakkety arm64 and ppc64el, to disable the minimum required. This is usually the case that's wanted. That way we'd catch any further regressions while Ubuntu keeps on changing.

Revision history for this message
Michi Henning (michihenning) wrote :

OK, will do that on Monday.

Revision history for this message
Charles Kerr (charlesk) wrote :

FWIW, I'm seeing very similar issues on indicator-network. It too uses qdbus, and I'm seeing a high volume of test regressions on yakkety arm64 and ppc64el: https://bugs.launchpad.net/ubuntu/+source/indicator-network/+bug/1626767

Revision history for this message
Michi Henning (michihenning) wrote :

Tests are disabled for on arm64, ppc64, and ppc64le for yakkety and xenial. That's where we've seen failures, and we don't particularly care about these arches at the moment.

Revision history for this message
Michi Henning (michihenning) wrote :

Changes are in silo 1991.

Revision history for this message
Steve Langasek (vorlon) wrote :

The package in the archive still depends on boost1.60, so this can't be promoted until ticket #1991 lands in the archive.

Revision history for this message
Michi Henning (michihenning) wrote :

Looks like 1991 has landed now.

Changed in thumbnailer (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.