Ubuntu
unity8 package

Abort when blanking/unblanking screen; exception thrown from mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)

Bug #1258655 reported by Ricardo Salveti on 2013-12-06

This bug affects 5 people

	Status	Importance	Assigned to
Mir	Invalid	Critical	Kevin DuBois
powerd	Fix Released	Critical	Seth Forshee
mir (Ubuntu)	Invalid	Critical	Kevin DuBois
powerd (Ubuntu)	Fix Released	Undecided	Unassigned
unity8 (Ubuntu)	Invalid	Undecided	Unassigned

Bug Description

Mako, image 50 + manual unity8 update (7.84+14.04.20131206.1-0ubuntu1).

Just press the power button quickly to blank/unblank the screen, and unity8 will crash with the following:

Program terminated with signal 6, Aborted.
#0 __libc_do_syscall () at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:44
44 ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S: No such file or directory.
(gdb) bt full
#0 __libc_do_syscall () at ../ports/sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:44
No locals.
#1 0x40e215fe in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
        _a1 = <optimized out>
        _a3tmp = 6
        _a1tmp = 0
        _a3 = <optimized out>
        _nametmp = 268
        _a2tmp = 1520
        _a2 = <optimized out>
        _name = <optimized out>
        _sys_result = <optimized out>
        pd = 0x46b3c460
        pid = 0
        selftid = 1520
#2 0x40e23e1a in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x40e83044 <write+52>, sa_sigaction = 0x40e83044 <write+52>}, sa_mask = {__val = {2, 1, 1089325599, 1088958560, 1088730173, 1126388064, 1186186336, 1126388224, 1186186336, 1086550016, 667796, 1186187552, 0, 3252,
              1186182968, 1073817549, 1126388224, 1, 5, 0, 85, 1088428904, 1, 1089326576, 1, 1441296276, 1086849877, 78, 85, 1073835572, 1444402408, 1}}, sa_flags = 2, sa_restorer = 0x1}
        sigs = {__val = {32, 0 <repeats 31 times>}}
#3 0x40c817b4 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
        terminating = true
        t = <optimized out>
#4 0x40c801d4 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:38
No locals.
#5 0x40c801f0 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
No locals.
#6 0x40c8041a in __cxxabiv1::__cxa_throw (obj=0x5617d540, tinfo=<optimized out>, dest=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_throw.cc:84
No locals.
#7 0x4142269e in void boost::throw_exception<boost::exception_detail::error_info_injector<std::runtime_error> >(boost::exception_detail::error_info_injector<std::runtime_error> const&) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#8 0x41426840 in mir::graphics::android::HWCCommonDevice::mode(MirPowerMode) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#9 0x41422f26 in mir::graphics::android::AndroidDisplay::configure(mir::graphics::DisplayConfiguration const&) () from /usr/lib/arm-linux-gnueabihf/libmirplatformgraphics.so
No symbol table info available.
#10 0x402efeae in DBusScreen::setScreenPowerMode(QString const&) () from /usr/lib/arm-linux-gnueabihf/libunity-mir.so.1
No symbol table info available.
#11 0x402f56e8 in DBusScreen::qt_metacall(QMetaObject::Call, int, void**) () from /usr/lib/arm-linux-gnueabihf/libunity-mir.so.1
No symbol table info available.
#12 0x402a01da in ?? () from /usr/lib/arm-linux-gnueabihf/libQt5DBus.so.5
No symbol table info available.
#13 0x402a01da in ?? () from /usr/lib/arm-linux-gnueabihf/libQt5DBus.so.5
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Tags:

Related branches

lp:~kdub/powerd/fix-1258655

Rejected for merging into lp:powerd

Seth Forshee (community): Disapprove on 2013-12-19

PS Jenkins bot: Approve (continuous-integration) on 2013-12-18

lp:~sforshee/powerd/fix-display-timing

Merged into lp:powerd at revision 105

Ricardo Salveti (community): Approve on 2014-01-08

PS Jenkins bot: Approve (continuous-integration) on 2014-01-06

Kevin DuBois (community): Approve (testing) on 2014-01-03

lp:ubuntu/trusty-proposed/powerd

Revision history for this message

Ricardo Salveti (rsalveti) wrote on 2013-12-06:

Crash file. Edit (14.2 MiB, text/plain)

tags:

added: avengers

Michał Sawicz (saviq) on 2013-12-06

Changed in unity8 (Ubuntu):
status:	New → Confirmed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-12-06:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mir (Ubuntu):
status:	New → Confirmed

Michał Sawicz (saviq) on 2013-12-06

Changed in unity8:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Alan Pope 🍺🐧🐱 🦄 (popey) wrote on 2013-12-06:

This happens on #51 on mako.

Ricardo Salveti (rsalveti) on 2013-12-07

tags:	added: regression
tags:	added: r51
summary:	- Abort when blanking/unblanking screen + [regression] Abort when blanking/unblanking screen

Revision history for this message

Michał Sawicz (saviq) wrote on 2013-12-07:

This is not a regression, I've reported this in October (bug #1238287), it just got lost in the slew of other raise() bugs, which LP unfortunately marks duplicate over-eagerly - and then I've mistakenly marked it Incomplete...

summary:

- [regression] Abort when blanking/unblanking screen
+ Abort when blanking/unblanking screen

Revision history for this message

Ricardo Salveti (rsalveti) wrote on 2013-12-09:

Not a regression but something happened with latest Unity8 that made it way easier to be reproduced. Just used latest image for a few minutes and was able to reproduce the crash a few times already.

kevin gunn (kgunn72) on 2013-12-09

Changed in mir (Ubuntu):
assignee:	nobody → Kevin DuBois (kdub)

Kevin DuBois (kdub) on 2013-12-10

tags:

removed: regression

Kevin DuBois (kdub) on 2013-12-10

Changed in mir (Ubuntu):
status:	Confirmed → In Progress

Daniel van Vugt (vanvugt) on 2013-12-11

summary:	- Abort when blanking/unblanking screen + Abort when blanking/unblanking screen; exception thrown from + mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)
affects:	unity8 → mir
Changed in mir:
assignee:	nobody → Kevin DuBois (kdub)
status:	Triaged → In Progress
Changed in unity8 (Ubuntu):
status:	Confirmed → Invalid
Changed in mir:
milestone:	none → 0.1.3

kevin gunn (kgunn72) on 2013-12-12

Changed in mir:
importance:	High → Critical

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-12:

From mir's perspective, i see "start, stop, start, stop, ..., start, start" being called on the compositor. The double start, start causes an error in the compositor. (specifically, an egl context is current in two threads, causing an error BAD_ACCESS in the second thread).

powerd-cli listen just reports the on/off/on/off pattern when the double start happens from mir's perspective. Currently digging a bit into unity-mir to see what's going on up there.

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-13:

dbus-log.txt Edit (2.1 KiB, text/plain)

Gerry collected this helpful dbus log http://pastebin.ubuntu.com/6567567/ (also attached) that shows we get an on, on message from dbus.

the error shown:
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error> >'
what(): could not activate surface with eglMakeCurrent

will happen if mc::Compositor::start() is called twice in succession. Still trying to figure out if the on, on message from dbus is something we should handle, or if something is getting confused somehow.

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-13:

I also did a mir-only test to make sure mir can perform a display on/off in quick succession in a loop (which it can), as well as on/display some frames/off in a loop (which it also can). So it doesn't look like the drivers are breaking on us, just something between mir/unity-mir/unity8/powerd is getting mixed up

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-17:

mir needs to be "robustified" against the double on and double off calls. Even when I account for this though, the system will still run into problems.

I still have reason to suspect that when the dbus message to unity mir to turn the display on is received, the system is still in the suspend state. Still trying to verify that when the display on message is sent, the system is guaranteed to be in the 'active' state before the message is sent.

Daniel van Vugt (vanvugt) on 2013-12-17

Changed in mir (Ubuntu):
status:	In Progress → Triaged
importance:	Undecided → Critical

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2013-12-17:

#10

There is only one possible exception thrown from the offending mga::HWCCommonDevice::mode()

    if (err)
    {
        std::string blanking_status_msg = "Could not " +
            ((mode_request == mir_power_mode_off) ? std::string("blank") : std::string("unblank")) + " display";
        BOOST_THROW_EXCEPTION(
            boost::enable_error_info(
                std::runtime_error(blanking_status_msg)) <<
            boost::errinfo_errno(-err));
    }

Safe to say whatever the cause, failure to power off the screen should not be treated as a fatal exception. Perhaps powering on should, but even then we need to avoid the potential spurious errors of "the screen is already on".

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-17:

#11

regardless of whether we throw or not on unblank, the problem at the root of this bug is that mir and powerd are not synced up about the power state of the system (clocks, suspend, fb available, etc). This problem sometimes manifests in failure in the blank/unblank request, sometimes a hang waiting for frame posting, and sometimes failure in hwc's set() function.

If i hold "powerd-cli active" (this should prevent suspending, but not affect anything with the display requests), I can power button on/off rapidly all day.

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-18:

#12

i've identified some problems in powerd that cause double requests to be called. currently working to eliminate those in that code base.

Kevin DuBois (kdub) on 2013-12-18

Changed in powerd:
status:	New → In Progress
assignee:	nobody → Kevin DuBois (kdub)
Changed in mir (Ubuntu):
status:	Triaged → Invalid

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-18:

#13

I've been able to avert the problem on mako by patching powerd. Currently cooking up a patch to propose to powerd that works in the generic case. The root of the problem was that devices with early suspend were turning mir on and off when they did not guarantee the framebuffer was in a usable state. The fix takes the form of ensuring thread safety around the display state in the different threads in powerd's src/display.c

when mir was turned on before the FB was available, (or when mir was turned off after the FB had been made unavailable) mir would fail with an error. my comment in #9 about mir/unity-mir being robustified against double on calls does not need to be done now, as fixing the locking also prevented ever being able to send 'double on' calls from being sent

Revision history for this message

Kevin DuBois (kdub) wrote on 2013-12-18:

#14

this problem would only affect devices that use 'earlysuspend'. Other devices do not have a second thread messing with the display state tracking, only earlysuspend devices would spawn the thread that was causing problems.

Changed in mir:
milestone:	0.1.3 → none

Kevin DuBois (kdub) on 2013-12-18

Changed in mir:
status:	In Progress → Invalid
Changed in powerd:
importance:	Undecided → Critical

Revision history for this message

Seth Forshee (sforshee) wrote on 2013-12-19:

#15

I've been poking at this a bit today, and I think it comes down to a race between powerd and the kernel's earlysuspend code. Here's the sequence of events I see in the logs for one case (there may also be other sequences that are problematic).

1. The initial state is that the screen is on. The power button is pressed.
2. powerd receives the button press, turns off the display, and releases its internal active state request.
3. powerd determines that suspend is possible and writes "mem" to /sys/power/state. The kernel schedules the earlysuspend work on an internal workqueue and the write call returns back to powerd.
4. The kernel starts processing the earlysuspend work.
5. Before the kernel earlysuspend work is complete, powerd receives another button press.
6. powerd looks to see whether it's safe to turn on the screen. At this point earlysuspend hasn't completed to the point where it's signaling that the framebuffer is asleep, so powerd determines that it is safe to turn on the display.
7. Unity tries to turn on the display, but the display device is in some partially or fully suspended state so something goes wrong.
8. Early suspend finishes and the fb state is updated, making powerd aware that the framebuffer is asleep.
8. Somewhere in this time frame powerd will also writes "on" to /sys/power/state to disable autosuspend. The kernel queues late resume work, but if early suspend has started then it must complete before late reusme runs and brings the framebuffer back to the awake state.

I'm not sure yet what the solution is. We can't simply synchronize the display state on wait_for_fb_* because it's at least theoretically possible to write "mem" then "on" to /sys/power/state quickly enough that early suspend never runs, and also because turning off the display doesn't necessarily imply enabling early suspend. One possible solution would be to make powerd's main thread block until the wait_for_fb_* status changes after writing to /sys/power/state for the earlysuspend implementation, but that would need to be done carefully to ensure that it wouldn't block indefinitely if something went wrong.

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2013-12-20:

#16

Forgive my silly question, but:
Why can't we just be more fault-tolerant and avoid the offending throw from Mir? Log a warning at most, but don't intentionally crash on it.

Would that result in an interaction with powerd rendering the system unusable? Or would it be sufficient to allow some recovery/retries?

Changed in mir:
milestone:	none → 0.1.3

Daniel van Vugt (vanvugt) on 2013-12-20

Changed in mir:
milestone:	0.1.3 → 0.1.4

Revision history for this message

Seth Forshee (sforshee) wrote on 2014-01-02: Re: [Bug 1258655] Re: Abort when blanking/unblanking screen; exception thrown from mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)

#17

On Fri, Dec 20, 2013 at 01:59:20AM -0000, Daniel van Vugt wrote:
> Forgive my silly question, but:
> Why can't we just be more fault-tolerant and avoid the offending throw from Mir? Log a warning at most, but don't intentionally crash on it.

+1. However ...

> Would that result in an interaction with powerd rendering the system
> unusable? Or would it be sufficient to allow some recovery/retries?

On mako I fear that the driver has bugs that may still result in
unrecoverable problems. This is based on having found unity8 threads
stuck in uninterruptible sleep when in this state due to being blocked
on some mutex within the dispaly driver. We could try to fix this in the
driver, but we may encounter similar problems with other android
kernels, so I think we need to find some way to make powerd synchronize
wrt the kernel early suspend processing.

So I think we should do both: make Mir do something sane when the ioctl
returns an error, and add a workaround to powerd to avoid this race.

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-01-03:

#18

With the way things are now, I'm pretty sure if we simply retried in mir after experiencing an error, we would hang the driver. Once we get these throws, the driver has essentially been put into an unrecoverable state. It seems Seth's investigation aligned with my intuition about the driver.

Revision history for this message

Seth Forshee (sforshee) wrote on 2014-01-03:

#19

I disagree. Mir shouldn't behave badly because the display driver is behaving badly. And I don't think crashing really helps anyway. Won't unity just be respawned and again try to use the display device without doing anything to fix the unrecoverable state?

If the display driver still hangs after powerd is changed to synchronize with early suspend / late resume, then we should try to fix the display driver.

Kevin DuBois (kdub) on 2014-01-03

Changed in mir:
milestone:	0.1.4 → none

Revision history for this message

Kevin DuBois (kdub) wrote on 2014-01-07:

#20

If I understand, we have a fix, so lets move with that.

We can always improve mir to be more robust, but the heart of this problem was that mir/unity-mir was not given the guarantee that it needs (the system on enough to start operations).

Ricardo Salveti (rsalveti) on 2014-01-08

Changed in powerd:
assignee:	Kevin DuBois (kdub) → Seth Forshee (sforshee)

Revision history for this message

PS Jenkins bot (ps-jenkins) wrote on 2014-01-08:

#21

Fix committed into lp:powerd at revision None, scheduled for release in powerd, milestone ubuntu-13.05

Changed in powerd:
status:	In Progress → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-01-08:

#22

This bug was fixed in the package powerd - 0.13+14.04.20140108.1-0ubuntu1

---------------
powerd (0.13+14.04.20140108.1-0ubuntu1) trusty; urgency=low

  [ Ricardo Salveti de Araujo ]
  * Using pkg-config files from hybris, and depending on latest version
    .

  [ Seth Forshee ]
  * Synchronously wait on the framebuffer state to change when
    entering/exiting suspend with kernels using the earlysuspend
    implementation to avoid races which can lead to attempting to use
    the display device while it is suspended. (LP: #1258655)

[ Ubuntu daily release ]
* Automatic snapshot from revision 105
-- Ubuntu daily release <email address hidden> Wed, 08 Jan 2014 06:17:08 +0000

Changed in powerd (Ubuntu):
status:	New → Fix Released

Pat McGowan (pat-mcgowan) on 2014-01-16

Changed in powerd:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuunity8 package

Abort when blanking/unblanking screen; exception thrown from mir::graphics::android::HWCCommonDevice::mode(MirPowerMode)

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
unity8 package