Abort when blanking/unblanking screen; exception thrown from mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)

Bug #1258655 reported by Ricardo Salveti on 2013-12-06
56
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Mir
Invalid
Critical
Kevin DuBois
powerd
Critical
Seth Forshee
mir (Ubuntu)
Critical
Kevin DuBois
powerd (Ubuntu)
Undecided
Unassigned
unity8 (Ubuntu)
Undecided
Unassigned

Bug Description

Mako, image 50 + manual unity8 update (7.84+14.04.20131206.1-0ubuntu1).

Just press the power button quickly to blank/unblank the screen, and unity8 will crash with the following:

Program terminated with signal 6, Aborted.
#0 __libc_do_syscall () at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:44
44 ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt full
#0 __libc_do_syscall () at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:44
No locals.
#1 0x40e215fe in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        _a1 = <optimized out>
        _a3tmp = 6
        _a1tmp = 0
        _a3 = <optimized out>
        _nametmp = 268
        _a2tmp = 1520
        _a2 = <optimized out>
        _name = <optimized out>
        _sys_result = <optimized out>
        pd = 0x46b3c460
        pid = 0
        selftid = 1520
#2 0x40e23e1a in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x40e83044 <write+52>, sa_sigaction = 0x40e83044 <write+52>}, sa_mask = {__val = {2, 1, 1089325599, 1088958560, 1088730173, 1126388064, 1186186336, 1126388224, 1186186336, 1086550016, 667796, 1186187552, 0, 3252,
              1186182968, 1073817549, 1126388224, 1, 5, 0, 85, 1088428904, 1, 1089326576, 1, 1441296276, 1086849877, 78, 85, 1073835572, 1444402408, 1}}, sa_flags = 2, sa_restorer = 0x1}
        sigs = {__val = {32, 0 <repeats 31 times>}}
#3 0x40c817b4 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
        terminating = true
        t = <optimized out>
#4 0x40c801d4 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
No locals.
#5 0x40c801f0 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
No locals.
#6 0x40c8041a in __cxxabiv1::__cxa_throw (obj=0x5617d540, tinfo=<optimized out>, dest=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:84
No locals.
#7 0x4142269e in void boost::throw_exception<boost::exception_detail::error_info_injector<std::runtime_error> >(boost::exception_detail::error_info_injector<std::runtime_error> const&) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#8 0x41426840 in mir::graphics::android::HWCCommonDevice::mode(MirPowerMode) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#9 0x41422f26 in mir::graphics::android::AndroidDisplay::configure(mir::graphics::DisplayConfiguration const&) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#10 0x402efeae in DBusScreen::setScreenPowerMode(QString const&) () from /usr/lib/arm-linux-gnueabihf/libunity-mir.so.1
No symbol table info available.
#11 0x402f56e8 in DBusScreen::qt_metacall(QMetaObject::Call, int, void**) () from /usr/lib/arm-linux-gnueabihf/libunity-mir.so.1
No symbol table info available.
#12 0x402a01da in ?? () from /usr/lib/arm-linux-gnueabihf/libQt5DBus.so.5
No symbol table info available.
#13 0x402a01da in ?? () from /usr/lib/arm-linux-gnueabihf/libQt5DBus.so.5
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Related branches

Ricardo Salveti (rsalveti) wrote :
tags: added: avengers
Michał Sawicz (saviq) on 2013-12-06
Changed in unity8 (Ubuntu):
status: New → Confirmed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mir (Ubuntu):
status: New → Confirmed
Michał Sawicz (saviq) on 2013-12-06
Changed in unity8:
status: New → Triaged
importance: Undecided → High

This happens on #51 on mako.

tags: added: regression
tags: added: r51
summary: - Abort when blanking/unblanking screen
+ [regression] Abort when blanking/unblanking screen
Michał Sawicz (saviq) wrote :

This is not a regression, I've reported this in October (bug #1238287), it just got lost in the slew of other raise() bugs, which LP unfortunately marks duplicate over-eagerly - and then I've mistakenly marked it Incomplete...

summary: - [regression] Abort when blanking/unblanking screen
+ Abort when blanking/unblanking screen
Ricardo Salveti (rsalveti) wrote :

Not a regression but something happened with latest Unity8 that made it way easier to be reproduced. Just used latest image for a few minutes and was able to reproduce the crash a few times already.

kevin gunn (kgunn72) on 2013-12-09
Changed in mir (Ubuntu):
assignee: nobody → Kevin DuBois (kdub)
Kevin DuBois (kdub) on 2013-12-10
tags: removed: regression
Kevin DuBois (kdub) on 2013-12-10
Changed in mir (Ubuntu):
status: Confirmed → In Progress
summary: - Abort when blanking/unblanking screen
+ Abort when blanking/unblanking screen; exception thrown from
+ mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)
affects: unity8 → mir
Changed in mir:
assignee: nobody → Kevin DuBois (kdub)
status: Triaged → In Progress
Changed in unity8 (Ubuntu):
status: Confirmed → Invalid
Changed in mir:
milestone: none → 0.1.3
kevin gunn (kgunn72) on 2013-12-12
Changed in mir:
importance: High → Critical
Kevin DuBois (kdub) wrote :

From mir's perspective, i see "start, stop, start, stop, ..., start, start" being called on the compositor. The double start, start causes an error in the compositor. (specifically, an egl context is current in two threads, causing an error BAD_ACCESS in the second thread).

powerd-cli listen just reports the on/off/on/off pattern when the double start happens from mir's perspective. Currently digging a bit into unity-mir to see what's going on up there.

Kevin DuBois (kdub) wrote :

Gerry collected this helpful dbus log http://pastebin.ubuntu.com/6567567/ (also attached) that shows we get an on, on message from dbus.

the error shown:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
  what(): could not activate surface with eglMakeCurrent

will happen if mc::Compositor::start() is called twice in succession. Still trying to figure out if the on, on message from dbus is something we should handle, or if something is getting confused somehow.

Kevin DuBois (kdub) wrote :

I also did a mir-only test to make sure mir can perform a display on/off in quick succession in a loop (which it can), as well as on/display some frames/off in a loop (which it also can). So it doesn't look like the drivers are breaking on us, just something between mir/unity-mir/unity8/powerd is getting mixed up

Kevin DuBois (kdub) wrote :

mir needs to be "robustified" against the double on and double off calls. Even when I account for this though, the system will still run into problems.

I still have reason to suspect that when the dbus message to unity mir to turn the display on is received, the system is still in the suspend state. Still trying to verify that when the display on message is sent, the system is guaranteed to be in the 'active' state before the message is sent.

Changed in mir (Ubuntu):
status: In Progress → Triaged
importance: Undecided → Critical
Daniel van Vugt (vanvugt) wrote :

There is only one possible exception thrown from the offending mga::HWCCommonDevice::mode()

    if (err)
    {
        std::string blanking_status_msg = "Could not " +
            ((mode_request == mir_power_mode_off) ? std::string("blank") : std::string("unblank")) + " display";
        BOOST_THROW_EXCEPTION(
            boost::enable_error_info(
                std::runtime_error(blanking_status_msg)) <<
            boost::errinfo_errno(-err));
    }

Safe to say whatever the cause, failure to power off the screen should not be treated as a fatal exception. Perhaps powering on should, but even then we need to avoid the potential spurious errors of "the screen is already on".

Kevin DuBois (kdub) wrote :

regardless of whether we throw or not on unblank, the problem at the root of this bug is that mir and powerd are not synced up about the power state of the system (clocks, suspend, fb available, etc). This problem sometimes manifests in failure in the blank/unblank request, sometimes a hang waiting for frame posting, and sometimes failure in hwc's set() function.

If i hold "powerd-cli active" (this should prevent suspending, but not affect anything with the display requests), I can power button on/off rapidly all day.

Kevin DuBois (kdub) wrote :

i've identified some problems in powerd that cause double requests to be called. currently working to eliminate those in that code base.

Kevin DuBois (kdub) on 2013-12-18
Changed in powerd:
status: New → In Progress
assignee: nobody → Kevin DuBois (kdub)
Changed in mir (Ubuntu):
status: Triaged → Invalid
Kevin DuBois (kdub) wrote :

I've been able to avert the problem on mako by patching powerd. Currently cooking up a patch to propose to powerd that works in the generic case. The root of the problem was that devices with early suspend were turning mir on and off when they did not guarantee the framebuffer was in a usable state. The fix takes the form of ensuring thread safety around the display state in the different threads in powerd's src/display.c

when mir was turned on before the FB was available, (or when mir was turned off after the FB had been made unavailable) mir would fail with an error. my comment in #9 about mir/unity-mir being robustified against double on calls does not need to be done now, as fixing the locking also prevented ever being able to send 'double on' calls from being sent

Kevin DuBois (kdub) wrote :

this problem would only affect devices that use 'earlysuspend'. Other devices do not have a second thread messing with the display state tracking, only earlysuspend devices would spawn the thread that was causing problems.

Changed in mir:
milestone: 0.1.3 → none
Kevin DuBois (kdub) on 2013-12-18
Changed in mir:
status: In Progress → Invalid
Changed in powerd:
importance: Undecided → Critical
Seth Forshee (sforshee) wrote :

I've been poking at this a bit today, and I think it comes down to a race between powerd and the kernel's earlysuspend code. Here's the sequence of events I see in the logs for one case (there may also be other sequences that are problematic).

1. The initial state is that the screen is on. The power button is pressed.
2. powerd receives the button press, turns off the display, and releases its internal active state request.
3. powerd determines that suspend is possible and writes "mem" to /sys/power/state. The kernel schedules the earlysuspend work on an internal workqueue and the write call returns back to powerd.
4. The kernel starts processing the earlysuspend work.
5. Before the kernel earlysuspend work is complete, powerd receives another button press.
6. powerd looks to see whether it's safe to turn on the screen. At this point earlysuspend hasn't completed to the point where it's signaling that the framebuffer is asleep, so powerd determines that it is safe to turn on the display.
7. Unity tries to turn on the display, but the display device is in some partially or fully suspended state so something goes wrong.
8. Early suspend finishes and the fb state is updated, making powerd aware that the framebuffer is asleep.
8. Somewhere in this time frame powerd will also writes "on" to /sys/power/state to disable autosuspend. The kernel queues late resume work, but if early suspend has started then it must complete before late reusme runs and brings the framebuffer back to the awake state.

I'm not sure yet what the solution is. We can't simply synchronize the display state on wait_for_fb_* because it's at least theoretically possible to write "mem" then "on" to /sys/power/state quickly enough that early suspend never runs, and also because turning off the display doesn't necessarily imply enabling early suspend. One possible solution would be to make powerd's main thread block until the wait_for_fb_* status changes after writing to /sys/power/state for the earlysuspend implementation, but that would need to be done carefully to ensure that it wouldn't block indefinitely if something went wrong.

Daniel van Vugt (vanvugt) wrote :

Forgive my silly question, but:
Why can't we just be more fault-tolerant and avoid the offending throw from Mir? Log a warning at most, but don't intentionally crash on it.

Would that result in an interaction with powerd rendering the system unusable? Or would it be sufficient to allow some recovery/retries?

Changed in mir:
milestone: none → 0.1.3
Changed in mir:
milestone: 0.1.3 → 0.1.4

On Fri, Dec 20, 2013 at 01:59:20AM -0000, Daniel van Vugt wrote:
> Forgive my silly question, but:
> Why can't we just be more fault-tolerant and avoid the offending throw from Mir? Log a warning at most, but don't intentionally crash on it.

+1. However ...

> Would that result in an interaction with powerd rendering the system
> unusable? Or would it be sufficient to allow some recovery/retries?

On mako I fear that the driver has bugs that may still result in
unrecoverable problems. This is based on having found unity8 threads
stuck in uninterruptible sleep when in this state due to being blocked
on some mutex within the dispaly driver. We could try to fix this in the
driver, but we may encounter similar problems with other android
kernels, so I think we need to find some way to make powerd synchronize
wrt the kernel early suspend processing.

So I think we should do both: make Mir do something sane when the ioctl
returns an error, and add a workaround to powerd to avoid this race.

Kevin DuBois (kdub) wrote :

With the way things are now, I'm pretty sure if we simply retried in mir after experiencing an error, we would hang the driver. Once we get these throws, the driver has essentially been put into an unrecoverable state. It seems Seth's investigation aligned with my intuition about the driver.

Seth Forshee (sforshee) wrote :

I disagree. Mir shouldn't behave badly because the display driver is behaving badly. And I don't think crashing really helps anyway. Won't unity just be respawned and again try to use the display device without doing anything to fix the unrecoverable state?

If the display driver still hangs after powerd is changed to synchronize with early suspend / late resume, then we should try to fix the display driver.

Kevin DuBois (kdub) on 2014-01-03
Changed in mir:
milestone: 0.1.4 → none
Kevin DuBois (kdub) wrote :

If I understand, we have a fix, so lets move with that.

We can always improve mir to be more robust, but the heart of this problem was that mir/unity-mir was not given the guarantee that it needs (the system on enough to start operations).

Changed in powerd:
assignee: Kevin DuBois (kdub) → Seth Forshee (sforshee)
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:powerd at revision None, scheduled for release in powerd, milestone ubuntu-13.05

Changed in powerd:
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package powerd - 0.13+14.04.20140108.1-0ubuntu1

---------------
powerd (0.13+14.04.20140108.1-0ubuntu1) trusty; urgency=low

  [ Ricardo Salveti de Araujo ]
  * Using pkg-config files from hybris, and depending on latest version
    .

  [ Seth Forshee ]
  * Synchronously wait on the framebuffer state to change when
    entering/exiting suspend with kernels using the earlysuspend
    implementation to avoid races which can lead to attempting to use
    the display device while it is suspended. (LP: #1258655)

  [ Ubuntu daily release ]
  * Automatic snapshot from revision 105
 -- Ubuntu daily release <email address hidden> Wed, 08 Jan 2014 06:17:08 +0000

Changed in powerd (Ubuntu):
status: New → Fix Released
Changed in powerd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers