A long startup fsck causes cyclic watchdog reboot

Bug #1093870 reported by Shaun Crampton
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
watchdog (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I'm using the watchdog timer on my FitPC 2 to reboot in case of a hard lockup. It worked well for several months, then the routine "fsck every N days" timer popped after a reboot. The fsck took longer than the watchdog period (250 seconds, which is just about the maximum) and my machine rebooted mid-fsck. Of course, it then tried to fsck again, causing a cyclic reboot.

Disabling the watchdog in the BIOS or cancelling the fsck allowed the machine to boot normally.

I suspect the watchdog daemon needs to be started earlier in the boot process, before any tasks that could potentially take a long time, such as an fsck.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: watchdog 5.11-1
ProcVersionSignature: Ubuntu 3.2.0-35.55-generic-pae 3.2.34
Uname: Linux 3.2.0-35-generic-pae i686
ApportVersion: 2.0.1-0ubuntu15.1
Architecture: i386
Date: Wed Dec 26 19:37:15 2012
InstallationMedia: Ubuntu-Server 10.10 "Maverick Meerkat" - Release i386 (20101007)
MarkForUpload: True
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
SourcePackage: watchdog
UpgradeStatus: Upgraded to precise on 2012-05-18 (222 days ago)
mtime.conffile..etc.watchdog.conf: 2012-05-24T03:51:16.033331

Revision history for this message
Shaun Crampton (fasaxc) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in watchdog (Ubuntu):
status: New → Confirmed
Revision history for this message
Paul Crawford (psc-sat) wrote :

Usually the wd_keepalive daemon would be started early for this reason, to keep the hardware from timing out. Once the machine is up it would hand over to the watchdog daemon that runs system tests.

I am guessing that the at-boot fsck is run before loading anything (daemons, etc) so all the file systems are free of any open handles, etc, that might cause problems for repair actions. I'm not sure how easy it would be to get round that problem.

Revision history for this message
Paul Crawford (psc-sat) wrote :
Download full text (5.7 KiB)

I think this problem is down to the IPMI style of watchdog being configured when the machine boots, but the Linux system not starting the refresh action until much later. For example if I try a test on my home PC (Ubuntu 12.04 64-bit desktop) after installing the watchdog and using "touch /forcefsck" to simulate this, my syslog has this relevant part:

Jan 19 21:30:28 paul-ubuntu kernel: [ 9.492544] [drm] Initialized radeon 2.36.0 20080528 for 0000:01:05.0 on minor 0
Jan 19 21:30:28 paul-ubuntu kernel: [ 9.617760] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 9.917460] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 10.217249] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 10.516992] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 10.816578] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 11.116403] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 11.416101] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 11.715686] HDMI ATI/AMD: no speaker allocation for ELD
Jan 19 21:30:28 paul-ubuntu kernel: [ 91.544151] EXT4-fs (md1): re-mounted. Opts: errors=remount-ro
Jan 19 21:30:28 paul-ubuntu kernel: [ 91.896020] EXT4-fs (md2): mounted filesystem with ordered data mode. Opts: (null)
Jan 19 21:30:28 paul-ubuntu kernel: [ 92.431631] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
Jan 19 21:30:28 paul-ubuntu kernel: [ 96.927417] EXT4-fs (md3): mounted filesystem with ordered data mode. Opts: (null)
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.037128] RPC: Registered named UNIX socket transport module.
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.037132] RPC: Registered udp transport module.
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.037133] RPC: Registered tcp transport module.
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.037136] RPC: Registered tcp NFSv4.1 backchannel transport module.
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.084401] FS-Cache: Loaded
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.274125] FS-Cache: Netfs 'nfs' registered for caching
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.316611] init: failsafe main process (1320) killed by TERM signal
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.382416] audit_printk_skb: 12 callbacks suppressed
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.382419] type=1400 audit(1453239028.288:16): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/rsyslogd" pid=1352 comm="apparmor_parser"
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.442406] Installing knfsd (copyright (C) 1996 <email address hidden>).
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.945825] type=1400 audit(1453239028.852:17): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/sbin/dhclient" pid=1415 comm="apparmor_parser"
Jan 19 21:30:28 paul-ubuntu kernel: [ 97.945833] type=1400 audit(1453239028.852:18): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=1415 c...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.