Too many crash files kill the device

Bug #1473562 reported by Jean-Baptiste Lallement
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
Critical
Steve Langasek
apport (Ubuntu)
Fix Released
High
Brian Murray
Vivid
Fix Released
Undecided
Brian Murray

Bug Description

Tested on krillin.

TEST CASE:
1. adb shell to the phone and create a crash file
$ sh -c 'kill -SEGV $$'
2. Now create dozens
$ for n in $(seq 50); do ln /var/crash/_bin_dash.32011.crash /var/crash/_bin_dash_${n}.32011.crash; done
3. Remove any "upload" and "uploaded" files that have been created and reboot
$ sudo rm /var/crash/*upload* && sudo reboot

ACTUAL RESULT
Lot of whoopsie-upload-all and apport processes are created on boot, consume all the resources of the system and make the phone unbootable or partially functional. OOM killer kills random system tasks such as upstart. Depending on the processes killed, the phone hangs on boot, reboots, dash doesn't come up...

The number of crashes in this test is a bit excessive but we can imagine a scenario where a dozen of crash files are not uploaded because the phone is on cellular data, and uploads everything when it connects to wifi, disabling the user session.

A way to recover is to go into recovery and clean /var/crash.

EXPECTED RESULT
crash uploads are serialized and can be uploaded only one at a time
If system resources are already low, the crash file is not uploaded.

description: updated
description: updated
Revision history for this message
John McAleely (john.mcaleely) wrote :

This is implicated in cases of 'ota bricked my phone', as the reboot that occurs triggers the process documented above.

Changed in canonical-devices-system-image:
assignee: nobody → Pat McGowan (pat-mcgowan)
importance: Undecided → Critical
milestone: none → ww34-2015
status: New → Confirmed
Changed in apport (Ubuntu):
assignee: nobody → Steve Langasek (vorlon)
Revision history for this message
Steve Langasek (vorlon) wrote :

The existing upstart job for whoopsie-upload-all is designed to ensure that only one whoopsie-upload-all process takes care of the upload processing on boot. However, it's true that if there are multiple crash files in the directory, separate whoopsie-upload-all processes will be /spawned/ for each crash file, and they will be spawned in parallel. So while the current behavior is responsible with its use of CPU time, it's less than ideal with memory usage; each instance of the whoopsie-upload-all python script uses about 50MB (on amd64), and while some of that is shared libraries, the stack usage will add up.

I suggest that a simple fix for this would be to change the apport-noui upstart job to wrap the calls to whoopsie-upload-all with /lib/udev/watershed. This would limit us to one running whoopsie-upload-all process at a time; there would be multiple watershed processes, but those processes take up much less memory (roughly 300kb each).

Brian, can you look at adding watershed to this job to see if that addresses the problems for the phone?

Changed in apport (Ubuntu):
assignee: Steve Langasek (vorlon) → Brian Murray (brian-murray)
importance: Undecided → High
status: New → Triaged
Changed in canonical-devices-system-image:
assignee: Pat McGowan (pat-mcgowan) → Steve Langasek (vorlon)
Revision history for this message
Brian Murray (brian-murray) wrote :

To be able to create this I had to make sure that wifi was enabled and a network configured and available before rebooting.

Revision history for this message
Brian Murray (brian-murray) wrote :

It might also be worth noting that the duplicate crashes in the test case won't be reported since they are all the same.

[11:33:07] Parsing /var/crash/_bin_dash_46.32011.crash.
[11:33:07] Uploading /var/crash/_bin_dash_46.32011.crash.
[11:33:08] Sent; server replied with: No error
[11:33:08] Response code: 400
[11:33:08] Server replied with:
[11:33:08] Crash already reported.
[11:33:08] Parsing /var/crash/_bin_dash_47.32011.crash.
[11:33:08] Uploading /var/crash/_bin_dash_47.32011.crash.
[11:33:09] Sent; server replied with: No error
[11:33:09] Response code: 400
[11:33:09] Server replied with:
[11:33:09] Crash already reported.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

You can refer to bug 1473449, comment #9 for a real world scenario, with real crash files, in which case crash reports are all different.

Revision history for this message
Brian Murray (brian-murray) wrote :

Modify the upstart job per Steve's suggestion and installing watershed on the device resolved the issue for me when I used the test case provided.

Changed in apport (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Brian Murray (brian-murray) wrote :

I've uploaded this to the vivid proposed queue for review by the SRU team now.

Changed in apport (Ubuntu):
status: In Progress → Fix Released
Changed in apport (Ubuntu Vivid):
assignee: nobody → Brian Murray (brian-murray)
status: New → In Progress
Revision history for this message
Adam Conrad (adconrad) wrote : Please test proposed package

Hello Jean-Baptiste, or anyone else affected,

Accepted apport into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/apport/2.17.2-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in apport (Ubuntu Vivid):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in canonical-devices-system-image:
status: Confirmed → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

Jean-Baptiste - Do you have any plans to test the SRU for this or shall I do it?

Revision history for this message
Dave Morley (davmor2) wrote :

I can confirm that this fixes the issue. All the crash files created had upload and uploaded files to match and the phone booted as expected. Repeated the process 3 times each time it booted as expected \o/

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apport - 2.17.2-0ubuntu1.2

---------------
apport (2.17.2-0ubuntu1.2) vivid-proposed; urgency=medium

  * apport-noui.upstart: Utilize watershed to only launch one instance of
    whoopsie-upload-all at a time. (LP: #1473562)
  * apport-noui: Depend on watershed.

 -- Brian Murray <email address hidden> Fri, 24 Jul 2015 15:27:31 -0700

Changed in apport (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for apport has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.