unity8 sometimes hangs on boot

Bug #1421009 reported by Francis Ginther on 2015-02-11
84
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Critical
Unassigned
autopilot (Ubuntu)
Undecided
Unassigned
libusermetrics (Ubuntu)
Critical
Pete Woods
lxc-android-config (Ubuntu)
Undecided
Unassigned
qtbase-opensource-src (Ubuntu)
Medium
Unassigned
ubuntu-system-settings-online-accounts (Ubuntu)
Undecided
Unassigned
unity8 (Ubuntu)
Undecided
Unassigned

Bug Description

The following gdbus call is failing with a "Error: Timeout was reached" message:

gdbus call --session --dest com.canonical.UnityGreeter --object-path / --method org.freedesktop.DBus.Properties.Get com.canonical.UnityGreeter IsActive

This is being seen on krillin devices starting with image 106 from ubuntu-touch/devel-proposed. It doesn't happen every time, so far today, I've seen it 3 times from about 12 tests. On the most recent failure, I grabbed a console and tried repeatedly to run the command from the shell, even after 2 hours the timeout was still being returned (after about 28 seconds).

A copy of ~/.cache/upstart/unity8.log is here: http://paste.ubuntu.com/10179482/

I have 3 test cases where the problem was observed:
http://d-jenkins.ubuntu-ci:8080/job/vivid-boottest-qtchooser/1/console
http://d-jenkins.ubuntu-ci:8080/job/vivid-boottest-gsettings-ubuntu-touch-schemas/1/console
http://d-jenkins.ubuntu-ci:8080/job/fjg-boottest/3/console

In all cases, the test is using adt-run (from autopkgtest) to drive a test on the phone device. adt-run uses the above gdbus call to determine if the desktop is active. In all the examples, the device was freshly flashed.

== Test Case ==

# Prepare debugging
adb shell
sudo apt-get clean # so that you wouldn't run out of disk space
sudo apt install qtbase5-dbg libc6-dbg libdbus-glib-1-2-dbg dbus-1-dbg libglib2.0-0-dbg

# Add also libusermetrics debug symbols, unless you're testing a PPA version
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 428D7C01
sudo apt-get update
sudo apt install libusermetricsoutput1-dbgsym=1.1.1+15.04.20150219-0ubuntu1

# Start the reboot loop
# This reboots the device in a loop, and if this bug is not fixed by whatever proposed solution, it will hang eventually with Unity 8 having a black background. Other kind of hangs (like just Google logo showing, no adb) are not related to this bug. Current highest amount of reboots without errors is 54, so it's probable a 100 reboots is needed for testing.

bzr branch lp:unity8
cd unity8
while true; do adb shell rm -R "~phablet/.cache/QML"; ./tools/unlock-device || break; done

# When it fails
adb shell
sudo gdb -p $(pidof unity8)
bt full

--
At this point, the backtrace should show:
#0 syscall () at ../sysdeps/unix/sysv/linux/arm/syscall.S:37
#1 0xb6301e12 in _q_futex (op=0, val=3, timeout=0x0, addr=<optimized out>)
    at thread/qmutex_linux.cpp:146
#2 lockInternal_helper<false> (timeout=-1, elapsedTimer=0x0, d_ptr=...)
    at thread/qmutex_linux.cpp:187
#3 QBasicMutex::lockInternal (this=this@entry=0x1523b44)
    at thread/qmutex_linux.cpp:203
#4 0xb6301eb6 in lock (this=0x1523b44) at thread/qmutex.h:59
#5 lock (timeout=-1, this=0x1523b38) at thread/qmutex.cpp:620
#6 QMutex::lock (this=this@entry=0x1523d6c) at thread/qmutex.cpp:215
#7 0xb5f39586 in QDBusMutexLocker (m=0x1523d6c, s=0x1523d48,
    a=ToggleWatchAction, this=<synthetic pointer>) at qdbusthreaddebug_p.h:183
#8 QDBusDispatchLocker (s=0x1523d48, a=ToggleWatchAction,
    this=<synthetic pointer>) at qdbusthreaddebug_p.h:198
#9 qDBusRealToggleWatch (d=0x1523d48, watch=0x1524dd0, fd=46)
    at qdbusintegrator.cpp:346
#10 0xb5ae18f6 in ?? () from /lib/arm-linux-gnueabihf/libdbus-1.so.3

With this, it's know that it was a QDBus locking related problem.
--

---

Timeline/Updates:
2015-02-20: libusermetrics lands, causing (apparently) this boot problem to start happening rarely. http://people.canonical.com/~ogra/touch-image-stats/106.changes / http://launchpadlibrarian.net/198152771/libusermetrics_1.1.1%2B14.10.20141020-0ubuntu1_1.1.1%2B15.04.20150219-0ubuntu1.diff.gz ”I got a symbolic trace out of all the threads. It seems to be a dbus lock between usermetrics and networkmanager bits. We suspect a relation to QTBUG https://bugreports.qt.io/browse/QTBUG-44836.”
2015-03-25: qtbase dbus update to support threads (instead of one main thread) in PPA 018 fixes the boot issue, but autopilot test suites start failing randomly.
2015-03-27: an autopilot fix fixes a simple test case, and seems to fix UITK suite as a whole, but on krillin only
2015-04-10: Further patches from upstream fix all AP tests.
2015-04-23: Upstream continues to work on the patches but they have not yet been merged. AP:s pass, but U1 account gets removed usually after a reboot, even though apps can be installed after adding U1 account flawlessly for the duration of that boot.

Related branches

Michael Terry (mterry) wrote :
Download full text (15.5 KiB)

I just got this after 7 tries. On the screen is the Ubuntu spinner. Here's a stacktrace:

(gdb) bt full
#0 0xffffffff in syscall () at /lib/arm-linux-gnueabihf/libc.so.6
#1 0xffffffff in QBasicMutex::lockInternal() (op=0, val=3, timeout=0x0, addr=<optimized out>) at thread/qmutex_linux.cpp:154
        addr2 = 0x0
        val2 = 0
#2 0xffffffff in QBasicMutex::lockInternal() (timeout=-1, elapsedTimer=0x0, d_ptr=...) at thread/qmutex_linux.cpp:195
#3 0xffffffff in QBasicMutex::lockInternal() (this=this@entry=0x132b2e4)
    at thread/qmutex_linux.cpp:211
#4 0xffffffff in QMutex::lock() (this=0x132b2e4) at thread/qmutex.h:67
        self = 0xb6efd410
        success = true
        current = 0x132b2d8
#5 0xffffffff in QMutex::lock() (timeout=-1, this=0x132b2d8)
    at thread/qmutex.cpp:628
        self = 0xb6efd410
        success = true
        current = 0x132b2d8
#6 0xffffffff in QMutex::lock() (this=this@entry=0x132b414)
    at thread/qmutex.cpp:223
        current = 0x132b2d8
#7 0xffffffff in qDBusRealToggleWatch(QDBusConnectionPrivate*, DBusWatch*, int) (m=0x132b414, s=0x132b3f0, a=ToggleWatchAction, this=<synthetic pointer>)
    at qdbusthreaddebug_p.h:191
        locker =
            {<QDBusMutexLocker> = {<QDBusLockerBase> = {<No data fields>}, self = 0x132b3f0, mutex = 0x132b414, action = ToggleWatchAction}, <No data fields>}
        i = <optimized out>
#8 0xffffffff in qDBusRealToggleWatch(QDBusConnectionPrivate*, DBusWatch*, int) (s=0x132b3f0, a=ToggleWatchAction, this=<synthetic pointer>)
    at qdbusthreaddebug_p.h:206
        locker =
            {<QDBusMutexLocker> = {<QDBusLockerBase> = {<No data fields>}, self = 0x132b3f0, mutex = 0x132b414, action = ToggleWatchAction}, <No data fields>}
        i = <optimized out>
#9 0xffffffff in qDBusRealToggleWatch(QDBusConnectionPrivate*, DBusWatch*, int) (d=0x132b3f0, watch=0x132c480, fd=41) at qdbusintegrator.cpp:344
        locker =
            {<QDBusMutexLocker> = {<QDBusLockerBase> = {<No data fields>}, self = 0x132b3f0, mutex = 0x132b414, action = ToggleWatchAction}, <No data fields>}
        i = <optimized out>
#10 0xffffffff in check_write_watch (transport=0x132c3e8)
    at ../../dbus/dbus-transport-socket.c:167
        needed = <optimized out>
        transport = 0x132c3e8
        socket_transport = 0x132c3e8
#11 0xffffffff in socket_do_iteration (transport=0x132c3e8, flags=1, timeout_milliseconds=<optimized out>) at ../../dbus/dbus-transport-socket.c:1210
        socket_transport = 0x132c3e8
        poll_fd = {fd = 41, events = 0, revents = -16705}
        poll_res = <optimized out>
        poll_timeout = <optimized out>
#12 0xffffffff in _dbus_transport_do_iteration (transport=0x132c3e8, flags=1, timeout_milliseconds=-1) at ../../dbus/dbus-transport.c:1001
#13 0xffffffff in _dbus_connection_do_iteration_unlocked (connection=0x132c808, pending=<optimized out>, flags=1, timeout_milliseconds=-1)
    at ../../dbus/dbus-connection.c:1227
#14 0xffffffff in _dbus_connection_send_preallocated_unlocked_no_update (connection=connection@entry=0x132c808, preallocated=0x0, message=message@entry=0x14ee998, client_serial=client_serial@entry=0x0) at ../../dbus/dbus-connection.c:205...

Michael Terry (mterry) wrote :

Looks like a bad interaction between libusermetrics and dbus that results in a mutex lock...

Michael Terry (mterry) wrote :

Hmm, neither of those packages changed recently. But nor did unity8... I doubt this is unity8's direct fault.

summary: - unity8 sometimes fails to respond to com.canonical.UnityGreeter IsActive
+ unity8 sometimes hangs on boot
Michał Sawicz (saviq) wrote :

Bugs #1417773 and #1418707 show similar dbus-related lockups :/

Michael Terry (mterry) wrote :

Interesting. They may be the same bug. But thankfully, this version of it is easily reproducable. Just flash and boot until you hit it.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in unity8 (Ubuntu):
status: New → Confirmed
Michał Sawicz (saviq) wrote :

I got a symbolic trace out of all the threads. It seems to be a dbus lock between usermetrics and networkmanager bits.

We suspect a relation to QTBUG https://bugreports.qt.io/browse/QTBUG-44836.

Changed in unity8 (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
assignee: nobody → Michał Sawicz (saviq)
Albert Astals Cid (aacid) wrote :

Yeah my current guess is that QDBusConnectionPrivate::connectSignal doesn't play well with dbus being used from other threads, but need more investigation

Michał Sawicz (saviq) on 2015-03-06
Changed in unity8 (Ubuntu):
assignee: Michał Sawicz (saviq) → Albert Astals Cid (aacid)
status: Triaged → In Progress
Albert Astals Cid (aacid) wrote :

Have been using https://codereview.qt-project.org/#/c/103738/ (and all the pre-requisite patches) for a while and they seem to fix the issues i've been having with dbus deadlocks

Michał Sawicz (saviq) wrote :

https://codereview.qt-project.org/#/c/107969/ is a simpler proposal that got merged into 5.4 already, we should backport that.

Changed in unity8 (Ubuntu):
status: In Progress → Invalid
assignee: Albert Astals Cid (aacid) → nobody
Changed in qtbase-opensource-src (Ubuntu):
assignee: nobody → Timo Jyrinki (timo-jyrinki)
Timo Jyrinki (timo-jyrinki) wrote :

So, 107969 broke things and couldn't be landed. We landed Qt 5.4.1 now, open to new suggestions that are bullet proof.

Changed in qtbase-opensource-src (Ubuntu):
status: New → Incomplete
Changed in qtbase-opensource-src (Ubuntu):
status: Incomplete → Confirmed
Francis Ginther (fginther) wrote :

Another boottest failure, this one seen with krillin image 152:

https://jenkins.qa.ubuntu.com/job/vivid-boottest-webbrowser-app/3/console

Timo Jyrinki (timo-jyrinki) wrote :

Albert's list of cherry-picked patches testable in vivid-021 landing silo together with bug #1431798 fix.

Michał Sawicz (saviq) on 2015-03-20
Changed in qtbase-opensource-src (Ubuntu):
status: Confirmed → In Progress
Michał Sawicz (saviq) wrote :

Without silo 21, I was able to get the deadlock with 6-8 reboots. With the silo, no deadlock in 30 reboots and a wipe.

Timo Jyrinki (timo-jyrinki) wrote :

The current set of qtbase DBus patches reliably regress when running UITK autopilot tests: http://people.canonical.com/~tjyrinki/qt5/fail/

The qtbase update - current version - has now moved to silo 018 to be able to land the qtdeclarative update from silo 021 still.

Changed in qtbase-opensource-src (Ubuntu):
status: In Progress → Incomplete
Albert Astals Cid (aacid) wrote :

Two more patches that are part of the patchset and would make sense having
https://codereview.qt-project.org/#/c/103739/
https://codereview.qt-project.org/#/c/103740/

Changed in qtbase-opensource-src (Ubuntu):
status: Incomplete → In Progress
Changed in qtbase-opensource-src (Ubuntu):
importance: Undecided → Critical
Timo Jyrinki (timo-jyrinki) wrote :

Adding libusermetrics as upgrade of it in http://people.canonical.com/~ogra/touch-image-stats/106.changes apparently caused the problem. However, it's probably just because libusermetrics started using DBus as explained in comment #7. The Qt bug itself is probably an old one.

If fixing Qt seems a time consuming process, an option would be to revert the part of the libusermetrics upgrade that triggers the bug.

Also adding autopilot as at least the current set of patches regressed on running Autopilot tests which may mean Autopilot is relying on old behavior and the patches themselves might be correct. We'll know more when the new 2 patches have built and tests run.

Changed in libusermetrics (Ubuntu):
status: New → Incomplete
Changed in autopilot (Ubuntu):
status: New → Incomplete
Timo Jyrinki (timo-jyrinki) wrote :

I ran some Autopilot testing last night and continuing a bit now . The two new patches do not fix the AP problems. As examples:

- Archive image (#145):
* UITK: 1-2 AP failures
* webbrowser-app: 1-2 AP failures
* calculator: 0 AP failures
* ubuntu-system-settings: 0 AP failures

- Image (#147) added with silo 018 (qtbase DBus patches)
* UITK: 7-8 failures (similar to earlier #145 + previous qtbase DBus fix build)
* webbrowser-app: 5-7 failures
* calculator: 2 failures
* ubuntu-system-settings: 16 AP failures (only 1 run done)

Details at http://people.canonical.com/~tjyrinki/qt5/fail/qtbase_ubuntu5/

Even if the qtbase patches themselves would be correct and functioning well and they just happen to make Autopilot break, we can't let Autopilot regress so badly. Either there should be something more done to Qt or an Autopilot landing fixing the issues should be put to the same silo.

As for libusermetrics, the diff for its landing in #106 is at http://launchpadlibrarian.net/198152771/libusermetrics_1.1.1%2B14.10.20141020-0ubuntu1_1.1.1%2B15.04.20150219-0ubuntu1.diff.gz

Changed in autopilot (Ubuntu):
status: Incomplete → New
Changed in qtbase-opensource-src (Ubuntu):
status: In Progress → Incomplete
Timo Jyrinki (timo-jyrinki) wrote :

All the failure traces seem to be stating that autopilot wouldn't be able to launch the application. In case of eg address-book-app once a test fails, all subsequent tests fail too. In case of eg UITK, it's just that random tests fail.

Here's a simplified test case:

while [ 1 ] ; do phablet-test-run ubuntuuitoolkit.tests.custom_proxy_objects.test_header.SectionsTestCase.test_select_sections_with_sections_disabled ; done

When the failure happens, the application seems to start fine, but autopilot thinks it did not. After waiting for some while (~38 seconds as opposed to ~6 seconds on successful run) it errors out.

Without silo 018 the above test case never fails.

description: updated
Albert Astals Cid (aacid) wrote :

Timo, can you try running the tests again with https://code.launchpad.net/~aacid/autopilot/dbus_search_no_seen_connections/+merge/254109 ? I've been running

while [ 1 ] ; do phablet-test-run ubuntuuitoolkit.tests.custom_proxy_objects.test_header.SectionsTestCase.test_select_sections_with_sections_disabled ; done

with it for over an hour and had no failure.

Richard Huddie (rhuddie) wrote :

I tried reproducing this issue on silo 18 using address_book_app.tests.test_add_contact.TestAddContact.test_go_to_add_contact.

The test failed on the 10th attempt in the same way as in the logs: http://people.canonical.com/~tjyrinki/qt5/fail/qtbase_ubuntu5/ap-2015_03_24-21_48_11-address_book_app-1-018.tests

Watching the test, the address-book-app froze as it was launching, displaying a frozen spinner on the loading screen. Autopilot then timed out waiting for the app to launch and reported the introspection failure.

Subsequent tests then failed with the maliit-server error as reported in the same logs.

The only crash file reported was: _usr_lib_telepathy_mission-control-5.32011.crash

Albert Astals Cid (aacid) wrote :

Richard, which hardware are you using?

Can you try reproducing the issue again and then gdb'ing in to the frozen app and posting a backtrace? I've been running it in a loop for minutes and had nothing.

Richard Huddie (rhuddie) wrote :

I was using mako devel-proposed image #148 with the silo installed.

I'll try and reproduce again and get some more info.

Timo Jyrinki (timo-jyrinki) wrote :

I've reflashed the device so this is with vivid #150 + 018. The MP in #23 (just adjusted for AP 1.5 branch) is now in the 018 PPA.

The simplified test case seems fixed for me too, but unfortunately it'd look like running a whole suite like phablet-test-run ubuntuuitoolkit simply hangs at some point - in the sense that test app loads but nothing further happens. Killing the process gets the tests to continue. This ended up with: http://people.canonical.com/~tjyrinki/qt5/fail/qtbase_ubuntu5/ap-2015_03_26-09_54_27-ubuntuuitoolkit-1-018.tests - potentially less failures (6) than before AP update, but more than before qtbase update.

The second UITK run with qtbase + AP seemingly went uninterrupted further, but still with more (7) AP failures compared to before updating qtbase. However, at the end of the run it seemed like the test app was hanging - I tried gdb attaching to it, realized I didn't have debug symbols and detached. After this it seemed the it continued and the testing completed immediately: http://people.canonical.com/~tjyrinki/qt5/fail/qtbase_ubuntu5/ap-2015_03_26-09_54_27-ubuntuuitoolkit-2-018.tests

So it'd seem the AP patch does change behavior and fixes the simple testcase, but is not the real fix. Or it's possible it fixes the AP end but there is some sort hanging issue with the updated Qt.

Richard Huddie (rhuddie) wrote :

I tried again to reproduce the issue, using #150 + silo 18 (including autopilot update). I've not been able to reproduce it again after many retries.

Albert Astals Cid (aacid) wrote :

running phablet-test-run ubuntuuitoolkit hanged for me with https://bugs.launchpad.net/ubuntu/+source/mir/+bug/1436860 not sure it's caused by these patches or just happens now more becuase things are faster, trying silo 13 as per mir guys suggetsion

Timo Jyrinki (timo-jyrinki) wrote :

webbrowser-app with the new qtbase + AP http://people.canonical.com/~tjyrinki/qt5/fail/qtbase_ubuntu5/ap-2015_03_26-15_57_41-webbrowser_app-1-018.tests (3 failures, between 1-2 without 018 and 5-7 with 018 before AP was updated)

Albert Astals Cid (aacid) wrote :

So i tried with silo 13 two times in a row and all the SDK autopilot tests passed fine.

Timo Jyrinki (timo-jyrinki) wrote :

Unfortunately I couldn't reproduce success with my mako last night with 018 (qtbase + autopilot) + 013 (Mir 0.12.1) + rsalveti (glibc fix). Here are the results until aborted: http://people.canonical.com/~tjyrinki/qt5/fail/vivid-ap-mir013/ - for example 6 + 8 failures in UITK.

I'll try again with the new #151 image including Mir, glibc and other things + 018, but I doubt it'll change this picture.

@Albert did you test on krillin or mako?

Albert Astals Cid (aacid) wrote :

krillin.

Timo Jyrinki (timo-jyrinki) wrote :

Looking in more detail, it seems the new Mir may have helped mako in the address-book-app "hang" case, even though for example UITK and webbrowser-app still have more failures. The address-book-app that failed reliably several times earlier, now in the comment #33 linked results failed only 1 test on both runs, similar to what happens on archive.

Rerunning now on image #151 + silo 018, UITK still seems problematic which is not a surprise since it should be similar to yesterday's combination of #150 + silo 018 + silo 013 + rsalveti PPA. This time executing tests seemed to have hanged at one point, and touching the screen interestingly resulted in Unity 8 crashing: http://people.canonical.com/~tjyrinki/qt5/fail/mako151_plus_018/_usr_bin_unity8.32011.crash

Albert Astals Cid (aacid) wrote :

The crash in http://people.canonical.com/~tjyrinki/qt5/fail/mako151_plus_018/_usr_bin_unity8.32011.crash is mir/unity8 crashing because they disagree on touch event delivery, nothing to do with dbus patches.

tags: added: lt-blocker lt-category-visible
Timo Jyrinki (timo-jyrinki) wrote :

The earlier simplified test case does not fail anymore or at least not easily. Whether on stock image or using the PPA, the problem with running the same test over and over is that Unity8 crashes or restarts sooner or later when running the command, for example that crasher bug noticed.

I ran a new baseline set of results on Saturday with image #153 to see how the situation looks like on mako after the Mir, UITK, glibc landings. It's still the same - UITK, webbrowser, calculator, ubuntu-system-settings, ubuntu-weather-app are good examples of suites that 100% pass on stock image. They gain respectively 4/9, 3/6, 1/1, 4/7, 3/4 tests failing based on two autopilot runs with the PPA 018 in use. The failures are random, ie the test failing changes depending on the run.

In other words, to reproduce the autopilot problem when using the PPA you can run any of the following on mako:
phablet-test-run ubuntuuitoolkit
phablet-test-run webbrowser_app
phablet-test-run ubuntu_calculator_app
phablet-test-run ubuntu_system_settings
phablet-test-run ubuntu_weather_app

Prepare the device by first running on device sudo apt install ubuntu-ui-toolkit-autopilot webbrowser-app-autopilot ubuntu-system-settings-autopilot python3-lxml

description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Albert Astals Cid (aacid) wrote :

My opinion is that all those failures where probably there already since the ones i've been able to find/replicate don't seem qdbus related

Now why is this happening with these patches? These patches make qdbus serialize much less on the main thread, i.e. things are executed more in different threads, this decreases the points of sincronization qdbus caused on the main thread without these patches and so threads run more "freely"

This can make this that were previously race hit the races and thus cause the autopilot tests to randomly fail or not depending if the race is fixed or not

Now i can't say that none of these failures are qdbus related since i've not been able to reproduce them all, would need help from the people with expertise on those particular apps to have a look at them.

Francis Ginther (fginther) wrote :
Timo Jyrinki (timo-jyrinki) wrote :

Addition to my previous comment, for weather + calculator apps you need to run phablet-click-test-setup from the host or one at a time:
phablet-click-test-setup --click com.ubuntu.calculator
phablet-click-test-setup --click com.ubuntu.weather

The other tests come from the mentioned deb packages.

Timo Jyrinki (timo-jyrinki) wrote :

And one more addition to anyone not constantly running autopilot and aware of the pre-requirements:

phablet-config autopilot --dbus-probe enable

is needed before running tests. This is documented at https://wiki.ubuntu.com/Touch/Testing#Running_Click_tests and also done automatically for me as I use UI Toolkit's test script to run my AP Testing: http://bazaar.launchpad.net/~ubuntu-sdk-team/ubuntu-ui-toolkit/trunk/view/head:/tests/uitk_test_plan.sh

Albert Astals Cid (aacid) wrote :

Ok, got one thread deadlock in the calculator app that seems dbus related, investigating a bit more

Timo Jyrinki (timo-jyrinki) wrote :

As per Albert's instructions I took three updated patches and five new patches from upstream. Four of them needed rebasing for Qt 5.4, but a local build now finished and a PPA build with version number ubuntu6~vivid1~test1 is now building in 018 PPA.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in autopilot (Ubuntu):
status: New → Confirmed
Timo Jyrinki (timo-jyrinki) wrote :

The bug is now finally confirmed as fixed without regressions in autopilot tests, in the silo 018 (qtbase version ending ubuntu6). The silo is now handed off to QA.

Changed in qtbase-opensource-src (Ubuntu):
status: Incomplete → In Progress
Changed in libusermetrics (Ubuntu):
status: Incomplete → Invalid
Albert Astals Cid (aacid) wrote :

Bad news, tried to start Plasma 5 with the patches to be greated by a zillion of crashing processes, here a backtrace of one of them

Thread 1 (Thread 0x7f658362a780 (LWP 10972)):
[KCrash Handler]
#6 0x00007f6583662168 in QDBusConnectionPrivate::shouldWatchService(QString const&) (this=this@entry=0x0, service=...) at qdbusintegrator.cpp:2310
#7 0x00007f6583662ce2 in QDBusConnectionPrivate::watchService(QString const&, QFlags<QDBusServiceWatcher::WatchModeFlag>, QObject*, char const*) (this=0x0, service=..., mode=..., obj=0xb226b8, member=member@entry=0x7f65836ac300 "1_q_serviceOwnerChanged(QString,QString,QString)") at qdbusintegrator.cpp:2315
#8 0x00007f65836a8695 in QDBusServiceWatcher::addWatchedService(QString const&) (service=..., this=<optimized out>) at qdbusservicewatcher.cpp:103
#9 0x00007f65836a8695 in QDBusServiceWatcher::addWatchedService(QString const&) (this=<optimized out>, newService=...) at qdbusservicewatcher.cpp:283
#10 0x00007f653fde3254 in () at /usr/lib/x86_64-linux-gnu/qt5/plugins/kded_touchpad.so
#11 0x00007f653fdcbd38 in () at /usr/lib/x86_64-linux-gnu/qt5/plugins/kded_touchpad.so
#12 0x00007f65823bd608 in KPluginFactory::create(char const*, QWidget*, QObject*, QList<QVariant> const&, QString const&) () at /usr/lib/x86_64-linux-gnu/libKF5CoreAddons.so.5
#13 0x00007f65832ece7d in () at /usr/lib/x86_64-linux-gnu/libkdeinit5_kded5.so
#14 0x00007f65832ed45f in () at /usr/lib/x86_64-linux-gnu/libkdeinit5_kded5.so
#15 0x00007f65832edc88 in () at /usr/lib/x86_64-linux-gnu/libkdeinit5_kded5.so
#16 0x00007f65832eec53 in () at /usr/lib/x86_64-linux-gnu/libkdeinit5_kded5.so
#17 0x00007f65832ef03a in kdemain () at /usr/lib/x86_64-linux-gnu/libkdeinit5_kded5.so
#18 0x00007f6582f38a40 in __libc_start_main (main=0x400720 <main>, argc=1, argv=0x7fff13734958, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff13734948) at libc-start.c:289
#19 0x000000000040074e in _start ()

Also konsole is now failing to start (this may be a misuse of api on KUniqueApplication) but still not good to regress like that

Timo Jyrinki (timo-jyrinki) wrote :

Indeed the testing was finished only on the phone + Unity 7 desktop (Qt 5 apps only). QA signoff is now put on hold once again because of the Plasma 5 regression.

The phone Autopilot results are at http://people.canonical.com/~tjyrinki/qt5/ap-2015-04-02-mako-156-silo-018/

Timo Jyrinki (timo-jyrinki) wrote :

Yesterday ubuntu7~test1 was done with http://paste.ubuntu.com/10759851/ - unfortunately no changes in Plasma 5 despite upstream thinking those would address the problem.

Now building ubuntu7~test2 with https://codereview.qt-project.org/#/c/109632/ updated to the latest patch set which was missed earlier.

Albert Astals Cid (aacid) wrote :

Seems the updated patch indeed fixes the crashes in Plasma 5 startup, but applications using KUniqueApplication like konsole or kfontview fail to start, these applications got a warning before (i.e. without the patchset) that they were misusing qdbus and i think that Qt changing to make them fail to start is "ok" but the problem is us introducing this before Qt does (qt 5.4 vs qt 5.5) and so suddenly/close to the vivid release.

Maybe this patchset could be applied either only to armhf or only to vivid-rtm (if we end up with that solution for this release) so the desktop apps using Qt5 have time to adapt?

Timo Jyrinki (timo-jyrinki) wrote :

The autopilot tests with ubuntu7~test3 are ready and looking (with unity8 and music-app flaky and could be rerun) good when compared to the stock archive results: http://people.canonical.com/~tjyrinki/qt5/ap-2015-04-09-mako-166-silo-018/. That means there's a good chance this is as good as before for Ubuntu Phone or even better, now that the KDE Plasma 5 also starts without problems too. The remaining problem are the KDE applications using deprecated behavior.

vivid-rtm should be available late next week so this could land there. Modifying the packaging to selectively patch only armhf might be an option but given vivid is <2 weeks from release even that's a bit far-fetched idea at this point. For vivid+1 that could be done so that we're not out of sync with rtm there.

Tony Espy (awe) wrote :

As the bug description seems to focus on test hangs, and mentions krillin and mako, I want to add that I see this on normal boots of mako ( vivid-proposed / #166 ) and arale ( vivid-proposed / #165 ); roughly 30% of the time.

Changed in canonical-devices-system-image:
status: New → In Progress
importance: Undecided → Critical
milestone: none → ww17-2015
Michał Sawicz (saviq) on 2015-04-15
Changed in autopilot (Ubuntu):
status: Confirmed → Fix Released
Timo Jyrinki (timo-jyrinki) wrote :

This is still pending on an upstream fix that doesn't break certain KDE apps using deprecated functions. Meanwhile, I can (and will) prepare a landing to vivid-rtm PPA if it seems a fix that can be landed to Ubuntu proper continues to take time.

Timo Jyrinki (timo-jyrinki) wrote :

With the current build 5.4.1+dfsg-2ubuntu6~rtm~test2 the KDE legacy app situation looks unchanged. Now the phone PPA is enabled so I will start evaluating if the current build would be suitable to there.

Timo Jyrinki (timo-jyrinki) wrote :

As an update ~test3 updated ppc64el symbols and "ubuntu7" is a no-change rebuild of ~test3 with a saner version number. Putting forward to QA for further testing and signoff.

Timo Jyrinki (timo-jyrinki) wrote :

QA has found an app installation problem sometimes happening with the silo 018, fixed at least by removing and re-adding U1 account. I did not see such problem, but I also flashed clean, updated and only then added the U1 account for installing apps so it maybe the reason I was able to install apps without problems.

I did see both with and without the PPA that sometimes adding U1 account in general stalls on vivid, so that's not a regression.

On IRC I got some info about how DBus is used with app installations: "dbus is used to monitor installation progress (for the progress bar in the preview). apps store scope just passes dbus object path to unity8 dash and doesn't interact with dbus directly (at least when it comes to installation); probably unity8/dash logs are first to look at"

I now updated to latest image, added U1 account, installed an app, and then upgraded to silo 018, and I can see the problem now. When I click "Install", dbus.log gets:
---
Activating service name='com.ubuntu.OnlineAccountsUi'
Successfully activated service 'com.ubuntu.OnlineAccountsUi'
---
but unity8-dash.log gets:
---
RequestAccess failed: QDBusError("org.freedesktop.DBus.Error.UnknownObject", "No such object path '/'")
---

Then when I clicked back arrow from the app page, I got a request for U1 credentials, which I canceled. When I go to u-s-s the U1 account is not there anymore. After adding U1 account, it appears that app installation work smoothly.

It seems to me the pattern is that the U1 account gets removed randomly when a reboot is involved. One time I observed the account had survived a reboot, but disappeared at the moment I clicked "Install". But it seems that without a reboot and with account added app installation seems to work for multiple apps without problems.

I did clean QML cache at one point so it does not seem to be related (also, qtdeclarative isn't touched in the PPA).

After ppa-purging the problem doesn't occur anymore. So the problem has to come from the QDBus changes in the silo, even if those provably fix the Unity 8 hanging issue during boot and no issues were seen in any of the autopilot tests.

description: updated
description: updated
Timo Jyrinki (timo-jyrinki) wrote :

https://code.launchpad.net/~mardy/ubuntu-system-settings-online-accounts/lp1421009/+merge/257267 got added to the PPA now. It seems to move the problem so that it never happens anymore during reboot - after reboot, the account is always still there.

But it'd seem possible that a similar fix to a racy code would be needed somewhere that gets called when "Install" button gets pressed for the first time after a reboot. It now often removes the account when it's pressed, although sometimes you can do a full "add U1 account, reboot, install app" cycle without problems.

Alejandro J. Cura (alecu) wrote :

With latest silo 18 on mako #183 I still got the account deleted after reboot.

And the spinner is still shown many times after clicking on "Sign In" in the online accounts screen, both when started from system-settings->account and from the apps scope "Install" button. Then it takes more than 10 minutes for the user to regain back control, after what looks like a unity8 crash.

While the spinner is shown, both online-accounts-ui and unity8 seem to go on a merry dance while chewing at the cpu. (perhaps it's online-accounts-lib the one inside the unity8 process, or it's qtdbus)

The top that alesage sees on arale looks similar to what I see on mako: http://pasteboard.co/2MKR1ed7.png

But, starting dbus-monitor doesn't show any unusual traffic that might explain the high cpu usage.

Michał Sawicz (saviq) wrote :

Note that the activity indicator itself might be the CPU hogger - see bug #1431957.

Timo Jyrinki (timo-jyrinki) wrote :

It seems I've found a workaround for the original unity8 hang on boot, which we can include in lxc-android-config without touching Qt while the upstream fixing of QDBus is not 100% complete.

Changed in lxc-android-config (Ubuntu):
assignee: nobody → Timo Jyrinki (timo-jyrinki)
status: New → In Progress
Timo Jyrinki (timo-jyrinki) wrote :

Scratch that. We started testing on arale, and it seems that on arale at least the delay workaround is not enough. It also seems 30 reboot is noy enough to reliably reproduce the hang, as once it happened after 54 reboots.

Changed in lxc-android-config (Ubuntu):
assignee: Timo Jyrinki (timo-jyrinki) → nobody
status: In Progress → Incomplete
description: updated
description: updated
Vincent Ladeuil (vila) wrote :

I've created https://code.launchpad.net/~canonical-platform-qa/unity8/1421009-hangs-on-boot/+merge/257408 to reproduce this bug.

Given the last comment above it may need to be tweaked s/30/60/ .

Vincent Ladeuil (vila) wrote :

Also reproduced on arale with the branch above after 25 reboots.

$ adb shell system-image-cli -i
current build number: 22
device name: arale
channel: ubuntu-touch/arale-vivid-proposed
last update: 2010-01-01 00:29:22
version version: 22
version ubuntu: 30e975dd988b3be7886017b6d46b2821da187d7b8e5e4c7a30b953114b50bb15
version device: 998baf0435d37e42421b39c34a27ed5f872c145f1d2016e08f477d9ad6b8a654
version custom: 76da72854bde96be6e9d546a607550a952b75ca9c3143e0002f457f826056e87

In addition, the first time I ran this test I ended up with a black screen with only the tiny ubuntu logo after two reboots only.
Rebooting the phone from the power button was enough to recover it though.

Looks like this test can reproduce more than one bug ;)

Pat McGowan (pat-mcgowan) wrote :

Tried running the new test but I get /tmp/autopkgtest-reboot: not found

Vincent Ladeuil (vila) wrote :

@Pat: Can you re-try adding '-d' to your adt-run options and share the log ?

/tmp/autopkgtest-reboot is created by adt-run on the testbed at run time.

I'm eager to learn how such a failure can happen and where it needs to be fixed.

Vincent Ladeuil (vila) wrote :

@Pat: Just in case, I'm on vivid and use autopkgtest 3.13, that may be related.

description: updated
description: updated
description: updated
Timo Jyrinki (timo-jyrinki) wrote :

Some weekend numbers of how many reboots needed to reproduce the problem on mako: 30, 12, 15, 59, 15, 9, 26, 10, 6. There's no clear upper limit.

In other testing, reverting the libusermetrics landing from February does not seem to cure the problem - I was able to reproduce the problem also with the revert I've pushed to https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-007/+packages

I've improved the test case in the description several times to get fuller backtraces. With the latest version I was able to see that both in case of the current libusermetrics and the reverted one the backtrace leads back to usermetrics's DBus usage. Which is not to say there's anything wrong with libusermetrics, as the bug is in multi-threaded handling of DBus inside Qt. The DBus call being called is:
#17 0xffffffff in dbus_bus_add_match (connection=0xe9d140, rule=0x11ba818 "type='signal',sender='com.canonical.UserMetrics',path='/com/canonical/UserMetrics/DataSource/2',interface='com.canonical.usermetrics.DataSource',member='emptyDataStringChanged'", error=0x0) at ../../dbus/dbus-bus.c:1553
        msg = 0xf76058
        __FUNCTION__ = "dbus_bus_add_match"

Timo Jyrinki (timo-jyrinki) wrote :
Timo Jyrinki (timo-jyrinki) wrote :

(note that the "path" and "member" may be different on each hang)

description: updated
description: updated
description: updated
Pete Woods (pete-woods) on 2015-04-27
Changed in libusermetrics (Ubuntu):
status: Invalid → In Progress
importance: Undecided → Critical
assignee: nobody → Pete Woods (pete-woods)
Timo Jyrinki (timo-jyrinki) wrote :

There was a tip from tvoss to try to use QDBusConnection::connectToBus instead of QDBusConnection::sessionBus() / systemBus(), to minimize sharing of DBus connections.

There are 3 PPA:s now all doing that:

- https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-007/ - Pete's changes to libusermetrics only (note, Pete is updating his branch but the current build is from his first revision that only updated systemBus calls)
- https://launchpad.net/~canonical-qt5-edgers/+archive/qt5-proper/ - Pete's and my changes to libusermetrics
- https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-010/ - Pete's and my changes to libusermetrics + similar

Using 007, my mako is now at 105 reboots and counting.

Timo Jyrinki (timo-jyrinki) wrote :

"+ similar changes to unity8 from Albert", the 010 description was meant to be.

Timo Jyrinki (timo-jyrinki) wrote :

Ok 007 was validated on mako and arale (>= 100 reboots), but we're switching silos now. Also qt5-proper was tested but it won't be needed at this point.

A new commit on Pete's libusermetrics branch combines his and my changes in a better way and that is now being built in silo _017_. We're going to revalidate that and if that's good, put it to QA for sign-off. Unity 8 changes would not be needed at this point.

So, testing should be done with https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/landing-017 once it has finished building.

Timo Jyrinki (timo-jyrinki) wrote :

arale testing stopped at 127 successful reboots, mako at 83. Now in QA sign-off.

description: updated
Timo Jyrinki (timo-jyrinki) wrote :

This bug is now fixed (workarounded) in the overlay PPA, ie the unity8 doesn't hang anymore on boot:

libusermetrics (1.1.1+15.04.20150427-0ubuntu1) vivid; urgency=medium

  [ Pete Woods ]
  * Stop using shared DBus connections (LP: #1421009)

 -- CI Train Bot <email address hidden> Mon, 27 Apr 2015 15:38:43 +0000

The bug will stay open for the possibility of getting the reworked QDBus from upstream at some point.

Changed in libusermetrics (Ubuntu):
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
status: In Progress → Fix Committed
Changed in lxc-android-config (Ubuntu):
status: Incomplete → Invalid
Changed in qtbase-opensource-src (Ubuntu):
importance: Critical → High
Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
tags: removed: lt-blocker
Changed in qtbase-opensource-src (Ubuntu):
importance: High → Medium
Changed in unity8 (Ubuntu):
importance: High → Undecided
Łukasz Zemczak (sil2100) wrote :

Removing it from the landing team tracker as in theory this is no longer a real issue.

Timo Jyrinki (timo-jyrinki) wrote :

Upstream has postponed their QDBus rework to Qt 5.6, so it seems there won't be anything to test for some time.

Changed in qtbase-opensource-src (Ubuntu):
status: In Progress → Incomplete
assignee: Timo Jyrinki (timo-jyrinki) → nobody
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ubuntu-system-settings-online-accounts - 0.6+15.10.20150624-0ubuntu1

---------------
ubuntu-system-settings-online-accounts (0.6+15.10.20150624-0ubuntu1) wily; urgency=medium

  [ Alberto Mardegan ]
  * Fix build with Qt 5.5 (LP: #1387537, #1421009, #1448878, #1447175)
  * Return the error name to the client (LP: #1441873)

  [ CI Train Bot ]
  * New rebuild forced.

 -- CI Train Bot <email address hidden> Wed, 24 Jun 2015 13:36:39 +0000

Changed in ubuntu-system-settings-online-accounts (Ubuntu):
status: New → Fix Released
Timo Jyrinki (timo-jyrinki) wrote :

Qt 5.6 now in yakkety and xenial-overlay with largely re-engineered QDBus (...with new but different bugs).

Changed in qtbase-opensource-src (Ubuntu):
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers