Watchdog is too aggressive, can lead to unusable device

Bug #1498133 reported by Alan Pope 🍺🐧🐱 🦄
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
Critical
Alejandro J. Cura
upstart-watchdog (Ubuntu)
Fix Released
Critical
dobey

Bug Description

upstart-watchdog reboots the phone if a process is restarting repeatedly.
In a situation where apport is disabled (which is the proposed situation on retail devices), a process which encounters a problem may restart repeatedly, causing watchdog to reboot the phone. If the situation hasn't changed then the next time the phone starts the same issue could happen.

I had this (bug 1498080) where a process - mediascanner - was crashing.
My phone was then stuck in a boot loop with watchdog rebooting due to repeated mediascanner crashes.

We should probably re-think this.

This solution was implemented to address bug #1394350

The original idea was to include a check on the number of reboots and go into recovery

Tags: hotfix

Related branches

description: updated
Changed in canonical-devices-system-image:
assignee: nobody → Pat McGowan (pat-mcgowan)
importance: Undecided → Critical
milestone: none → ww46-2015
status: New → Confirmed
Changed in canonical-devices-system-image:
milestone: ww46-2015 → ww40-2015
Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

see also bug #1381075 upstart should report applications that hit respawn limit to errors.ubuntu.com

tags: added: hotfix
Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

In the short term lets not include this package, please check with phonedations prior to disabling

Changed in upstart-watchdog (Ubuntu):
assignee: nobody → Łukasz Zemczak (sil2100)
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I have the seed changes prepared for upload. Are we sure we want to remove upstart-watchdog from the images? I suppose it does save us from certain issues like the device hanging-up completely, right? Anyway, waiting for the final +1.

Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

The decision short term is to disable the session watchdog and leave the system watchdog in place
Then work toward a more suitable solution

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I have a package of upstart-watchdog ready that has the session watchdog disabled, waiting for QA to confirm that this does fix the particular issue at hand.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

This is a problem on krillin/devel-proposed 213

The phone reboots after a few minutes then enters a boot loop. Logs attached.

Changed in upstart-watchdog (Ubuntu):
importance: Undecided → Critical
Changed in canonical-devices-system-image:
status: Confirmed → Triaged
Changed in upstart-watchdog (Ubuntu):
status: New → Confirmed
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

The log indicates that the system watchdog is causing the bootloop on devel-proposed

Oct 2 07:28:51 ubuntu-phablet watchdog: 'ubuntu-espoo-service' (instance '') hit respawn limit - rebooting

Moreover it is rebooting the system for a service that is important but not vital to critical features of the phone. In this case it means there will be no location, but dialler or data should still be working.

I think we should consider the option of disabling the system watchdog too.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

On devel-proposed I disabled both watchdogs, the services crashes, but the shell starts and I can make a call or networking is up for example.

Here is what the watchdog noticed and the reason it wopuld have rebooted if it was not disabled:
Oct 2 08:05:47 ubuntu-phablet watchdog: 'ubuntu-espoo-service' (instance '') hit respawn limit but not rebooting
Oct 2 08:05:48 ubuntu-phablet watchdog: 'ubuntu-location-provider-here-posclientd' (instance '') hit respawn limit but not rebooting
Oct 2 08:05:48 ubuntu-phablet watchdog: 'ubuntu-location-provider-here-slpgwd' (instance '') hit respawn limit but not rebooting
Oct 2 08:06:11 ubuntu-phablet watchdog: 'ubuntu-location-provider-here-slpgwd' (instance '') hit respawn limit but not rebooting

A system service crash that would make the device unusable is not worst than rebooting it in a loop.

Changed in canonical-devices-system-image:
assignee: Pat McGowan (pat-mcgowan) → Alejandro J. Cura (alecu)
Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

We are working on a limit of one reboot, then we can choose to continue to allow both session and system services to be watched
I will note although without specifics
- users have reported spontaneous reboots of their phones
- users have reported boot loops that were not explained

Changed in upstart-watchdog (Ubuntu):
assignee: Łukasz Zemczak (sil2100) → Rodney Dawes (dobey)
dobey (dobey)
Changed in upstart-watchdog (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package upstart-watchdog - 0.4

---------------
upstart-watchdog (0.4) wily; urgency=medium

  * Don't trigger reboot if reboot was triggered in last hour.
    (LP: #1498133)

 -- Rodney Dawes <email address hidden> Mon, 05 Oct 2015 15:22:53 -0400

Changed in upstart-watchdog (Ubuntu):
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
status: Triaged → Fix Committed
Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.