apport takes too long to write crash report, appears to lock up phone

Bug #1278780 reported by Alan Pope 🍺🐧🐱 🦄 on 2014-02-11
62
This bug affects 12 people
Affects Status Importance Assigned to Milestone
Apport
Undecided
Unassigned
Canonical System Image
Critical
Steve Langasek
apport (Ubuntu)
High
Unassigned
qtmir (Ubuntu)
Undecided
Unassigned
qtubuntu (Ubuntu)
Undecided
Unassigned

Bug Description

I can trigger a crash easily on my phone via bug 1262711. Other bugs are available.

When that happens my phone appears to freeze. I am unable to do anything for approximately 1 to 1.5 minutes. As a user my initial gut reaction is to reboot the phone, thus losing the crash report, and wasting my time.

Having the phone lock up for 1.5 minutes is a terrible user experience. Can we fix/mitigate/workaround that?

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: apport 2.13.2-0ubuntu2
Uname: Linux 3.4.0-3-mako armv7l
ApportVersion: 2.13.2-0ubuntu2
Architecture: armhf
CrashReports:
 664:32011:110:10083:2014-02-10 15:41:18.152893384 +0000:2014-02-10 15:11:09.169231740 +0000:/var/crash/_usr_lib_arm-linux-gnueabihf_upstart-app-launch_desktop-hook.32011.crash
 640:0:110:1681527:2014-02-10 15:12:10.985193887 +0000:2014-02-10 15:12:05.639489630 +0000:/var/crash/_usr_bin_powerd.0.crash
 640:0:110:21384:2014-02-11 07:58:44.876281991 +0000:2014-02-11 07:58:44.876281991 +0000:/var/crash/_usr_sbin_system-image-dbus.0.crash
 640:32011:110:17122318:2014-02-11 09:19:49.915478726 +0000:2014-02-11 09:18:20.850439824 +0000:/var/crash/_usr_bin_unity8.32011.crash
Date: Tue Feb 11 09:20:15 2014
InstallationDate: Installed on 2014-02-11 (0 days ago)
InstallationMedia: Ubuntu Trusty Tahr (development branch) - armhf (20140211)
PackageArchitecture: all
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 SHELL=/bin/bash
SourcePackage: apport
UpgradeStatus: No upgrade log present (probably fresh install)

Dave Morley (davmor2) on 2014-02-12
Changed in apport (Ubuntu):
status: New → Confirmed

It's estimated to have a moderate impact on a large portion of Ubuntu users.

Changed in apport (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in apport:
status: New → Confirmed

On Mon, Jun 22, 2015 at 02:48:01PM +0200, Oliver Grawert wrote:
> seems that whoopsie refuses to send your report because you are not on
> wifi, perhaps it should check for this earlier and not even run the
> collection, not sure ...

The only part that should be running when not on wifi is the crash
collection. We certainly *should* be running that when not on wifi; you
don't get a second chance to run the kernel crash handler, and we want to
know about crashes that only happen when not on wifi (including, possibly,
crashes that happen /because/ you're not on wifi).

On Mon, Jun 22, 2015 at 02:19:10PM +0100, Alan Pope wrote:

> Also, related to the discussion:-
> https://bugs.launchpad.net/ubuntu/+source/apport/+bug/1278780

The trouble is that, until the crash handler has finished consuming the
core file from the kernel and exited, the original process is blocked. So
the shell can't know that the process has died and move on until the crash
handler has finished running. (The same is true on the desktop, it's just
less impactful because the app is usually not full screen and blocking the
UI at the time.)

We have already taken steps to minimize the amount of processing done on the
core file as part of the apport crash handler, in favor of post-processing
the crash file from a queue. It's not clear what further optimization of
the crash handler is possible.

On Mon, Jun 22, 2015 at 02:39:04PM +0200, Oliver Grawert wrote:
> hi,
> Am Montag, den 22.06.2015, 14:28 +0200 schrieb Marco F:

> > Who decided that collection of debug information is the task with
> > highest priority on a phone which wishes to be used by the public?

> i think it already runs at a very low prio (obviously still to high for
> the phone though ... )

Running it with a low priority would NOT improve the user experience. It
would just cause the app to hang longer in an unkillable, memory-consuming
state.

Torsten Sachse (torsten-sachse) wrote :

On Mon, 22 Jun 2015, Steve Langasek wrote:
> The only part that should be running when not on wifi is the crash
> collection. We certainly *should* be running that when not on wifi; you
> don't get a second chance to run the kernel crash handler, and we want to
> know about crashes that only happen when not on wifi (including, possibly,
> crashes that happen /because/ you're not on wifi).

While that may be true, there has to be an option to switch off such crash
collection permanently or temporarily. This must not be enforced on the user,
which is the way it is right now, if I understood all this correctly. Not only
because some people might not feel comfortable with this informationbeing sent
before they can review it.

> The trouble is that, until the crash handler has finished consuming the core
> file from the kernel and exited, the original process is blocked. So the
> shell can't know that the process has died and move on until the crash
> handler has finished running. (The same is true on the desktop, it's just
> less impactful because the app is usually not full screen and blocking the UI
> at the time.)

Thanks for the explanation. To me, this is just all the more reason to have an
option to disable it, maybe only for a day or a couple. Imagine being somewhere
and urgently needing your phone which then hangs because some crash logs are
being collected? The app that crashed might not even be the one you urgently
need. Crash collection should, imho, not take precedence over the dialog to
accept a call as this is still the main functionaly of a phone, at least for
me.

Cheers,
Torsten

Steve Langasek (vorlon) wrote :

On Mon, Jun 22, 2015 at 08:21:44PM +0200, Torsten Sachse wrote:
> On Mon, 22 Jun 2015, Steve Langasek wrote:
> >The only part that should be running when not on wifi is the crash
> >collection. We certainly *should* be running that when not on wifi; you
> >don't get a second chance to run the kernel crash handler, and we want to
> >know about crashes that only happen when not on wifi (including, possibly,
> >crashes that happen /because/ you're not on wifi).

> While that may be true, there has to be an option to switch off such crash
> collection permanently or temporarily.

There is an option, and there is reportedly a bug in the handling of that
option, as was already mentioned in this thread.

  https://bugs.launchpad.net/ubuntu/+source/whoopsie-preferences/+bug/1437633

I was responding to the suggestion that the behavior should somehow be
dependent on whether the device is connected to wifi at the time of the
crash. That's just wrong.

> >The trouble is that, until the crash handler has finished consuming the core
> >file from the kernel and exited, the original process is blocked. So the
> >shell can't know that the process has died and move on until the crash
> >handler has finished running. (The same is true on the desktop, it's just
> >less impactful because the app is usually not full screen and blocking the UI
> >at the time.)

> Thanks for the explanation. To me, this is just all the more reason to
> have an option to disable it, maybe only for a day or a couple. Imagine
> being somewhere and urgently needing your phone which then hangs because
> some crash logs are being collected? The app that crashed might not even
> be the one you urgently need. Crash collection should, imho, not take
> precedence over the dialog to accept a call as this is still the main
> functionaly of a phone, at least for me.

While the shell can't do anything with the crashed app until the crash
handler has finished, that *shouldn't* mean that it prevents the shell from,
e.g., switching apps. This is why people were asking about crash files for
unity8. It's unexpected that a crashed app should take out the /whole/ UI,
and if that happens that's a bug in unity8, not just a bug in the app that's
crashing.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Torsten Sachse (torsten-sachse) wrote :

On Mon, 22 Jun 2015, Oliver Grawert wrote:
>Am Montag, den 22.06.2015, 20:21 +0200 schrieb Torsten Sachse:
>> On Mon, 22 Jun 2015, Steve Langasek wrote:
>> > The only part that should be running when not on wifi is the crash
>> > collection. We certainly *should* be running that when not on
>> > wifi; you don't get a second chance to run the kernel crash
>> > handler, and we want to know about crashes that only happen when
>> > not on wifi (including, possibly, crashes that happen /because/
>> > you're not on wifi).
>>
>> While that may be true, there has to be an option to switch off such
>> crash collection permanently or temporarily. This must not be
>> enforced on the user, which is the way it is right now, if I
>> understood all this correctly.
>
>there is a bug, bugs happen, people make mistakes ... you make it
>sound like this is intentional ...

I know that people make mistakes as I just made a stupid one myself. To
me, Steve's email sounded as if it was intentional due to the
highlighted "should" which I misunderstood as a "must". I know now that
I completely missed the point of the message.

On Mon, 22 Jun 2015, Steve Langasek wrote:
> There is an option, and there is reportedly a bug in the handling of
> that option, as was already mentioned in this thread.
> https://bugs.launchpad.net/ubuntu/+source/whoopsie-preferences/+bug/1437633

Thank you for the link, I must have overlooked that one somehow. Sorry
for that. However, I can confirm that the option sticks after making the
system writable. Actually, after remounting / as ro again, the option
can no longer be switched on (consistent which what's decribed in the
above thread).

> I was responding to the suggestion that the behavior should somehow be
> dependent on whether the device is connected to wifi at the time of
> the crash. That's just wrong.

I completely agree. The data should either be collected or not at all,
irrespective of the current network connectivity.

> While the shell can't do anything with the crashed app until the crash
> handler has finished, that *shouldn't* mean that it prevents the shell
> from, e.g., switching apps. This is why people were asking about
> crash files for unity8. It's unexpected that a crashed app should
> take out the /whole/ UI, and if that happens that's a bug in unity8,
> not just a bug in the app that's crashing.

Thanks for all the information.

Cheers,
Torsten

Changed in canonical-devices-system-image:
assignee: nobody → Steve Langasek (vorlon)
importance: Undecided → Critical
milestone: none → ww40-2015
status: New → Confirmed
Steve Langasek (vorlon) wrote :

some thoughts upon analysis of /usr/share/apport/apport:

- apport does have support for blacklisting crash reporting for specific programs. For programs that are critical for the UI on the phone and which cause a hard lockup of the UI while the crash report is running, we could use this blacklist as a workaround.
- apport currently calls os.nice(10) to avoid interfering with the running system while processing the crash file. This is counterproductive when the process being retraced is critical to the system.
- apport's crash dump handling is already carefully written to avoid bottlenecks in either memory or disk writes (the crash dump is read from the kernel 1MB at a time, compressed, and written to the crash file); this should already be close to optimal (1MB may not be the optimal write block size, but it will compress to variable size anyway and we're not flushing the disk between writes so the kernel should do a better job of optimizing for us). Thus, aside from possibly dropping the call to os.nice(), I confirm that there's not much of anywhere for us to go in improving the performance of the crash handler. The only other thing that might improve performance is for the crashing process to simply have less mapped into memory at crash time - I guess that a crash dump of the unity system compositor would be quite large due to having graphics buffers mapped into memory?

Martin Pitt (pitti) wrote :

Thanks for the nice summary Steve. I dropped the nice in http://bazaar.launchpad.net/~apport-hackers/apport/trunk/revision/3003 . It might certainly improve things a bit, but if it takes long now it will still take long with this; this is a little optimization, not really a complete solution.

We discussed some other approaches: the unity compositor could install a signal handler for SIGSEGV, and if it encounters a crash, release all its video buffers, and re-raise the signal. This could help a lot already, as presumably the mmapped video memory takes the majority of the core dumped area. This should be confirmed by experiments, though.

Steve Langasek (vorlon) wrote :

opening a task on unity8, for discussion with the unity team about the possibility of unmapping video buffers from a SIGSEGV handler.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Michael Zanetti (mzanetti) wrote :

Assigning down to QtMir as it does all the video buffer handling for unity8.

Changed in qtmir (Ubuntu):
status: New → Confirmed
affects: unity8 (Ubuntu) → qtmir (Ubuntu)
Steve Langasek (vorlon) wrote :

it would generally not be appropriate for a library to install a signal handler, however. Is this something that will need support up and down the unity8 stack?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qtmir (Ubuntu):
status: New → Confirmed
Gerry Boland (gerboland) wrote :

Ok, task is for Qt clients to intercept the sigsegv signal, and in a handler close the mir connection which should release all its video buffers, then re-raise segv. Would it be of interest to have all mir clients exhibit this behaviour by default? If so, should put this behaviour in "mirclient"

If just for Qt, then Qtubuntu.

If unity8 crashes, there is the similar issue that it will hang until apport releases it. Mir/QtMir could gain similar ability to release its buffers before coredump, which would shrink the collected core. This could again be a Mir task.

QtMir's ApplicationManager will need adjusting to allow new instance of app to be launched, while the old instance is dumping core.

Changed in qtubuntu:
status: New → Confirmed

On Tue, Sep 15, 2015 at 08:45:39PM -0000, Gerry Boland wrote:
> Ok, task is for Qt clients to intercept the sigsegv signal, and in a
> handler close the mir connection which should release all its video
> buffers, then re-raise segv. Would it be of interest to have all mir
> clients exhibit this behaviour by default? If so, should put this
> behaviour in "mirclient"

Not Qt clients. The issue here is with crashes of the compositor itself.
Those are the ones that unavoidably lock up the UI while the kernel crash
handler is being run.

Gerry Boland (gerboland) on 2015-09-15
Changed in qtubuntu:
status: Confirmed → Invalid
Changed in canonical-devices-system-image:
milestone: ww40-2015 → ww46-2015
Pat McGowan (pat-mcgowan) wrote :

@thomas have you discussed a new approach for crash analysis?

Thomas Voß (thomas-voss) wrote :

@Pat: Not yet, on my list.

Thomas Voß (thomas-voss) wrote :

So a few thoughts:

  * Releasing graphics buffers in case of SIGSEGV seems to be quite dangerous as we are dealing with potentially corrupted memory. I don't think we should take this approach.
  * It would be nice to have the ability to skip core dumping and instead just produce a threaded stack trace (probably in a gray list maintained by apport). This obviously takes away some information, but it's probably a good tradeoff for the time being.
  * If we really want to dump, we could investigate into sendfile. I'm not entirely sure that it works with data coming in via stdin, but it's worth a try as we would avoid the kernel -> userspace copy.

Steve Langasek (vorlon) wrote :

On Thu, Oct 22, 2015 at 03:27:59PM -0000, Thomas Voß wrote:
> So a few thoughts:

> * Releasing graphics buffers in case of SIGSEGV seems to be quite
> dangerous as we are dealing with potentially corrupted memory. I
> don't think we should take this approach.

Why is this "quite dangerous"? Releasing the buffers should be a simple
matter of munmap(), shouldn't it? (We shouldn't do any kind of complex
"cleanup" of the buffers in the SIGSEGV handler, just drop them completely
from the process's memory.) If the references to the graphics buffers have
themselves been corrupted, then you can wind up unmapping the wrong area of
memory; but that is an unlikely scenario, and the worst case outcome is that
it causes a second segfault, which we can be sure to handle correctly (by
not handling it).

> * It would be nice to have the ability to skip core dumping and instead
> just produce a threaded stack trace (probably in a gray list
> maintained by apport). This obviously takes away some information,
> but it's probably a good tradeoff for the time being.

That is fundamentally not possible without running the process under a
tracer (such as gdb). If you are using the kernel crash handler, the only
way to get this stack trace is by first reading the core from the kernel fd,
because this fd isn't going to be seekable.

> * If we really want to dump, we could investigate into sendfile. I'm not
> * entirely sure that it works with data coming in via stdin, but it's
> * worth a try as we would avoid the kernel -> userspace copy.

This is an interesting suggestion. It would imply requiring a second
post-processing stage, to combine this file with the rest of the crash
report in proper (compressed, base64-encoded) format; but that
post-processing is certainly something that could be done outside of the
kernel handler, letting the crashed process exit sooner.

IMHO this is worth investigating, but should be done in parallel with the
munmap() handling. Writing large uncompressed crash files to disk, even
with sendfile, is going to be unpleasant, and we still want to minimize the
amount of irrelevant information they contain.

Martin Pitt (pitti) wrote :

Steve Langasek [2015-10-22 18:11 -0000]:
> > * If we really want to dump, we could investigate into sendfile. I'm not
> > * entirely sure that it works with data coming in via stdin, but it's
> > * worth a try as we would avoid the kernel -> userspace copy.
>
> This is an interesting suggestion. It would imply requiring a second
> post-processing stage, to combine this file with the rest of the crash
> report in proper (compressed, base64-encoded) format;

How would that help? In any case you first need to put the core dump
from stdin onto the disk, and that's precisely the bit that we are
doing now (and not much more) and which takes so long. We also
compress it along the way of course, mostly because we don't have much
choice. Uncompressed core dumps are huuuge, and writing them to things
like slow and small MMC cards/phone memory wouldn't help in the
slightest.

Steve Langasek (vorlon) wrote :

On Fri, Oct 23, 2015 at 01:07:16PM -0000, Martin Pitt wrote:
> Steve Langasek [2015-10-22 18:11 -0000]:
> > > * If we really want to dump, we could investigate into sendfile. I'm not
> > > * entirely sure that it works with data coming in via stdin, but it's
> > > * worth a try as we would avoid the kernel -> userspace copy.

> > This is an interesting suggestion. It would imply requiring a second
> > post-processing stage, to combine this file with the rest of the crash
> > report in proper (compressed, base64-encoded) format;

> How would that help? In any case you first need to put the core dump
> from stdin onto the disk, and that's precisely the bit that we are
> doing now (and not much more) and which takes so long.

The difference is precisely in the putting of the core dump onto the disk.
Today we write it out from apport; that means we copy the data out from the
kernel into userspace, gzip it, base64 encode it, and then hand it back to
the kernel (by writing it to the file). With sendfile(), we would be
directly moving the core to the file of our choosing, *in kernel space*,
without any userspace processing. It's not even necessary that we flush the
disk buffers, so the slow disk writes don't necessarily happen before apport
returns.

This doesn't guarantee that sendfile will actually be faster. But I think
we need to measure this and find out, rather than assuming one way or the
other.

Changed in canonical-devices-system-image:
milestone: ww46-2015 → backlog
Michał Sawicz (saviq) on 2017-03-13
affects: qtubuntu → qtubuntu (Ubuntu)
Steve Langasek (vorlon) on 2017-06-12
Changed in canonical-devices-system-image:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers