boot occasionally hangs while "Checking battery state..."

Bug #1061149 reported by Barry Warsaw on 2012-10-03
52
This bug affects 10 people
Affects Status Importance Assigned to Milestone
landscape-client (Ubuntu)
Undecided
Unassigned
lightdm (Ubuntu)
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned

Bug Description

This is a bit of an odd environment, but I've recently noticed that boot will occasionally hang just after "Checking battery state...". I don't know for sure it's /etc/acpi/power.sh that's hanging, but that seems like the most likely culprit.

This is an up-to-date 12.10 guest running in VMware Fusion 5.0.1 (825449) on OS X 10.6.8. I've disabled the splash screen so I can watch the boot process. ctrl-C (and other keystrokes) seems to be echoed back to the console, but has no effect.

When I send ctrl-alt-delete (via the VMware "Virtual Machine" menu item), I see "acpid exiting" and the machine shuts down and reboots. It usually successfully reboots after this, but not always.

I suppose not surprisingly, this seems related to power status pass-through. If you go to VMware's settings->Other->Advanced and turn off "Pass power status to VM" then it seems like boot never hangs. Note though that I have had this setting enabled for ages, and enabling doesn't guarantee that the boot will hang. But it never seems to hang with the setting disabled.
---
ApportVersion: 2.6.1-0ubuntu3
Architecture: amd64
DistroRelease: Ubuntu 12.10
InstallationMedia: Ubuntu 12.10 "Quantal Quetzal" - Alpha amd64 (20120709.1)
Package: pm-utils
PackageArchitecture: amd64
ProcVersionSignature: Ubuntu 3.5.0-17.28-generic 3.5.5
Tags: quantal running-unity
Uname: Linux 3.5.0-17-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
---
ApportVersion: 2.6.2-0ubuntu4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: barry 3957 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
DistroRelease: Ubuntu 13.04
HibernationDevice: RESUME=UUID=d0d87984-8662-4d52-a2e5-fbb9dd3493e7
InstallationDate: Installed on 2012-07-10 (132 days ago)
InstallationMedia: Ubuntu 12.10 "Quantal Quetzal" - Alpha amd64 (20120709.1)
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
Lsusb:
 Bus 002 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse
 Bus 002 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 004: ID 0e0f:0008 VMware, Inc.
MachineType: VMware, Inc. VMware Virtual Platform
MarkForUpload: True
Package: linux (not installed)
ProcFB: 0 svgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.7.0-2-generic root=UUID=175800e0-cc7f-4820-8df3-7ae1df07f80e ro
ProcVersionSignature: Ubuntu 3.7.0-2.8-generic 3.7.0-rc5
RelatedPackageVersions:
 linux-restricted-modules-3.7.0-2-generic N/A
 linux-backports-modules-3.7.0-2-generic N/A
 linux-firmware 1.97
RfKill:
 0: hci0: Bluetooth
  Soft blocked: no
  Hard blocked: no
Tags: raring running-unity
Uname: Linux 3.7.0-2-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
dmi.bios.date: 07/02/2012
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: 6.00
dmi.board.name: 440BX Desktop Reference Platform
dmi.board.vendor: Intel Corporation
dmi.board.version: None
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 1
dmi.chassis.vendor: No Enclosure
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd07/02/2012:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A:
dmi.product.name: VMware Virtual Platform
dmi.product.version: None
dmi.sys.vendor: VMware, Inc.

Brad Figg (brad-figg) on 2012-10-10
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.6 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. Please only remove that one tag and leave the other tags. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-quantal/

tags: added: kernel-da-key needs-upstream-testing
Changed in linux (Ubuntu):
importance: Undecided → Medium

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1061149

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Barry Warsaw (barry) wrote :

Testing with the upstream 3.6 kernel still hangs the boot, but in a different place. It freezes after this line:

* Starting VMware Tools services [fail]

(The failure isn't surprising because I didn't reinstall vmware-tools after the kernel upgrade.)

After sending ctrl-alt-del to the VM, I see this message:

acpid: exiting

then the VM shutsdown and restarts. However, on subsequent restart, I see the freeze after "Checking battery state..." again.

tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
tags: added: apport-collected quantal running-unity
description: updated

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Do you happen to know if this issue happened with previous Quantal or Precise kernels?

On Oct 11, 2012, at 03:59 PM, Joseph Salisbury wrote:

>Do you happen to know if this issue happened with previous Quantal or
>Precise kernels?

This started happening with a relatively recent Quantal update, though I can't
pinpoint exactly when, or what updated introduced it.

Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. It would be very helpful to know the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and report back? We are looking for the first kernel version that doesn't have this bug:

3.5 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-quantal/
3.5.1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.1-quantal/
3.5.3: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.3-quantal/
3.5.5: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5.5-quantal/

You don't have to test every kernel, just up until the kernel that first has this bug. If the 3.5 final kernel has the bug, then we would need to test some of the earlier release candidates.

Thanks in advance!

tags: added: performing-bisect
no longer affects: acpi-support (Ubuntu)
no longer affects: pm-utils (Ubuntu)
Barry Warsaw (barry) wrote :

I found this much earlier bug report which exhibits the same behavior: LP: #289513

Barry Warsaw (barry) wrote :

I've upgraded the VM to Raring and I'm still seeing the issue. What suggestions to you have to debugging it now? It's currently running 3.7.0-0-generic x86_64.

Barry Warsaw (barry) wrote :

`initctl list | grep wait | sort ` on the affected machine, and comparing against a working (albeit running quantal) non-vm, i see a few interesting things.

present on the hanging machine:

friendly-recovery stop/waiting
kmod stop/waiting
lightdm stop/waiting
vmware-tools stop/waiting

missing on the hanging machine:

hybrid-gfx stop/waiting
screen-cleanup stop/waiting

Joseph Salisbury (jsalisbury) wrote :

Were you able to test some of the kernels listed in comment #8? I can perform a kernel bisect if we can identify the last good kernel and the first bad kernel.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Barry Warsaw (barry) wrote :

It turns out I still have some older kernels on my machine. I tried 3.5.0-7 and it hangs in the same place. Let me know if you still want me to try the 3.5 final quantal kernel referenced above (which I think is the oldest one on your list).

Changed in linux (Ubuntu):
status: Incomplete → New

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1061149

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: raring
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

apport information

apport information

apport information

apport information

apport information

Barry Warsaw (barry) wrote :

apport-collect done

Joseph Salisbury (jsalisbury) wrote :

Can you also test some of the earlier 3.5 release candidates:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc3-quantal/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc5-quantal/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc7-quantal/

No need to test rc5 or rc7 if rc3 has the bug. We just want to figure out the earliest kernel that does not have the bug.

Barry Warsaw (barry) wrote :

Hi Joseph. Sadly, the v3.5-rc3-quantal kernel exhibits the same hanging behavior. Note that it doesn't hang every time, but probably 4 out of 5 reboots. Sure smells like a race condition.

Joseph Salisbury (jsalisbury) wrote :

Did you run Precise on this machine in the past? If so, did Precise also exhibit this bug? If possible, it would be good to know if the bug existed in the prior kernels:

v3.5-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc1-quantal/
v3.4 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-quantal/
v3.3 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/

Barry Warsaw (barry) wrote :

I've essentially been upgrading this machine all along since at least Precise, and I don't remember it hanging before mid-Quantal. The v3.3 kernel does exhibit the hangs.

I've subscribed James Hunt, just to get his feedback, because at this point, I'm really not sure it *is* a problem with the kernel. It smells more like a boot race condition, possibly pointing to a problem in upstart perhaps? In any case, the v3.3 Precise kernel hung, and as I mentioned I don't remember this hanging until Quantal.

One thing I will try to do in the next few days, is grab a Precise iso and do a fresh install to see if that exhibits the problem.

Barry Warsaw (barry) wrote :

I spent the good part of a day doing more testing.

I created a fresh 12.04 vm from the precise-desktop-amd64 ISO, dist-upgraded to the latest 12.04.1 (3.2.0-33-generic) kernel. I rebooted it 10 times in a row without hanging.

Then I took a disk snapshot and upgraded the VM to 12.10 (3.5.0-18-generic), and now it starts hanging intermittently again (say 4 hangs out of 10 reboots). What's interesting is that in the Quantal hangs, I always see something like

mountall: Disconnected from Plymouth
mountall: Plymouth command failed
...and then...
ctrl-alt-del always (as with raring previously) shows the acpid: exiting message

Fusion has an advanced option to "Pass power status to VM". In all the reboot tests in this comment, I made sure this option was always turned off. Now I turn the option on and see the hang 6 of 10 times, with the same Plymouth and acpid messages. Unscientifically, it seems like setting this option makes the hang more likely, but turning it off doesn't eliminate the hanging either. This is roughly what I observe in Raring too.

I guess that means we have a good lower bound for the bisect? ;)

Joseph Salisbury (jsalisbury) wrote :

Can you run some additional tests?

1. Test the Quantal kernel[0] on 12.04 vm. It can be a fresh 12.04 install or dist-upgraded to 12.04.1.
2. Test the Precise kernel[1] on the 12.10 (3.5.0-18-generic) vm.

This will tell us if the bug follows the Quantal kernel to Precise and if using the Precise kernel on Quantal makes the bug go away.

Also, Precise never actually used the v3.3 kernel, so this bug may have been introduced in one of the v3.3 release candidates. Can you also test the v3.3-rc4[2] kernel to see if it exhibits this bug? I picked v3.3-rc4 since its in the middle of v3.3-rc1 and v3.3 final.

[0] https://launchpad.net/ubuntu/+source/linux/3.5.0-18.29
[1] https://launchpad.net/ubuntu/+source/linux/3.2.0-33.52
[2] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc4-precise/

On Nov 27, 2012, at 10:52 PM, Joseph Salisbury wrote:

>1. Test the Quantal kernel[0] on 12.04 vm. It can be a fresh 12.04 install
>or dist-upgraded to 12.04.1.

This combination, specifically dist-upgraded to 12.04.1 + the 3.5.0-18.29
kernel from Quantal boots every single time. 10-for-10 with no freezes.

Barry Warsaw (barry) wrote :

On Nov 27, 2012, at 10:52 PM, Joseph Salisbury wrote:

>2. Test the Precise kernel[1] on the 12.10 (3.5.0-18-generic) vm.

dist-upgraded quantal 12.10 VM + 3.2.0-33.52 kernel freezes 5-of-10 times.

Barry Warsaw (barry) wrote :

On Nov 27, 2012, at 10:52 PM, Joseph Salisbury wrote:

>Also, Precise never actually used the v3.3 kernel, so this bug may have
>been introduced in one of the v3.3 release candidates. Can you also
>test the v3.3-rc4[2] kernel to see if it exhibits this bug? I picked
>v3.3-rc4 since its in the middle of v3.3-rc1 and v3.3 final.

I tested 3.3-rc4 on the Quantal VM and this froze 4-of-10 times.

I'm still not entirely convinced this is a kernel problem, since I've reliably
seen the freezes on Quantal with every suggested kernel combination, but not
with Precise with those suggested kernel combinations. I'm still open to any
other testing you need.

Joseph Salisbury (jsalisbury) wrote :

I think you are correct in #36, Barry. I think we ruled out the kernel. This seems like a userspace issue since the Precise kernel exhibits this bug on a Quantal install, but the Quantal kernel does not exhibit this bug on a Precise install.

Maybe an issue with X due to what you see in comment #11 ?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Barry Warsaw (barry) wrote :

On Dec 04, 2012, at 06:35 PM, Joseph Salisbury wrote:

>I think you are correct in #36, Barry. I think we ruled out the kernel.
>This seems like a userspace issue since the Precise kernel exhibits this
>bug on a Quantal install, but the Quantal kernel does not exhibit this
>bug on a Precise install.
>
>Maybe an issue with X due to what you see in comment #11 ?

I'm not so sure, given that I've tried this with a fresh-precise -> quantal
upgrade path.

I still want to rule out upstart (or possibly a job race condition) so I'll
ping James in the morning to see if you can provide some debugging
suggestions.

Barry Warsaw (barry) wrote :

as requested by slangasek:

% ls -l /etc/rc2.d
total 4
-rw-r--r-- 1 root root 677 Jul 20 22:42 README
lrwxrwxrwx 1 root root 22 Jul 10 09:51 S19spamassassin -> ../init.d/spamassassin*
lrwxrwxrwx 1 root root 26 Jul 10 09:50 S20clamav-freshclam -> ../init.d/clamav-freshclam*
lrwxrwxrwx 1 root root 20 Jul 10 09:26 S20kerneloops -> ../init.d/kerneloops*
lrwxrwxrwx 1 root root 17 Oct 9 21:45 S20postfix -> ../init.d/postfix*
lrwxrwxrwx 1 root root 17 Aug 13 13:47 S20schroot -> ../init.d/schroot*
lrwxrwxrwx 1 root root 27 Jul 10 09:26 S20speech-dispatcher -> ../init.d/speech-dispatcher*
lrwxrwxrwx 1 root root 13 Aug 7 08:56 S23ntp -> ../init.d/ntp*
lrwxrwxrwx 1 root root 23 Aug 20 17:57 S38open-vm-tools -> ../init.d/open-vm-tools*
lrwxrwxrwx 1 root root 26 Jul 10 09:51 S45landscape-client -> ../init.d/landscape-client*
lrwxrwxrwx 1 root root 15 Jul 10 09:26 S50rsync -> ../init.d/rsync*
lrwxrwxrwx 1 root root 15 Jul 10 09:26 S50saned -> ../init.d/saned*
lrwxrwxrwx 1 root root 19 Jul 10 09:26 S70dns-clean -> ../init.d/dns-clean*
lrwxrwxrwx 1 root root 18 Jul 10 09:26 S70pppd-dns -> ../init.d/pppd-dns*
lrwxrwxrwx 1 root root 14 Nov 17 18:11 S75sudo -> ../init.d/sudo*
lrwxrwxrwx 1 root root 22 Jul 10 09:26 S99acpi-support -> ../init.d/acpi-support*
lrwxrwxrwx 1 root root 21 Jul 10 09:26 S99grub-common -> ../init.d/grub-common*
lrwxrwxrwx 1 root root 18 Jul 10 09:26 S99ondemand -> ../init.d/ondemand*
lrwxrwxrwx 1 root root 18 Jul 10 09:26 S99rc.local -> ../init.d/rc.local*

Barry Warsaw (barry) wrote :

After chatting w/slangasek and jodh on IRC, they feel it could be an X or lightdm bug. Possibly LP: #969489 related. I'm going to attach some logs.

Changed in linux (Ubuntu):
status: Incomplete → New
Barry Warsaw (barry) wrote :
Barry Warsaw (barry) wrote :
Barry Warsaw (barry) wrote :
Barry Warsaw (barry) wrote :
Steve Langasek (vorlon) wrote :

The lightdm log shows this:

[+3.67s] DEBUG: Got signal 15 from process 939
[+3.67s] DEBUG: Caught Terminated signal, shutting down

So, something is killing lightdm with a signal.

From the PID, and from an 'initctl list' on the same system, it's clear that the signal is coming from some process that has started on runlevel 2 (cron, atd, and acpid all have PIDs of 93x).

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lightdm (Ubuntu):
status: New → Confirmed
Steve Langasek (vorlon) wrote :

What it comes down to, I think, is that something non-standard is killing lightdm that shouldn't be; /var/log/lightdm.log gives us enough info to know the pid of what's doing the killing, but not the name. Given that at and cron have just started, it *could* be a strange cronjob or at job; it's more likely to be something called from an upstart job; booting with --verbose and cross-referencing that output against the 'initctl list' output and the /var/log/lightdm.log from the same boot *may* help isolate this.

Further things to investigate:

 - is this reproducible with a freshly-installed raring userspace?
 - are there any modified or orphaned config files under /etc/? (sudo apt-get install debsums; sudo debsums -s -e; sudo find /etc -type f -print0 | sudo xargs -0 dpkg -S > /dev/null)
 - can you capture the guilty process on a bootchart? (apt-get install bootchart)

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Barry Warsaw (barry) wrote :

Quick follow up: a completely fresh Raring install from 64bit iso downloaded 2 days ago stalls as described on the first reboot from install. After that, 10 straight reboots and no stall. `sudo apt-get dist-upgrade` and reboot 10 times, no stalls.

So for now, let's ignore the first stall after installation, and assume the problem is something weird with my previous 3-version-upgrades VM. I'll keep using this new VM and try the additional suggestions above if the problem crops up again.

Barry Warsaw (barry) wrote :

A process of elimination, seems to indicate it's landscape related. By going through my new machine setup <https://wiki.ubuntu.com/BarryWarsaw> and doing multiple reboots with disk snapshotting, it seems like things are fine until I install Landscape and register the machine with the Landscape server. What packages does that yield? dpkg.log says:

python-gdbm
landscape-common
landscape-client
landscape-client-ui

Now, clearly I'm using landscape on many other machines without any boot problems, although this VM is the only one with landscape-client-ui.

Even stranger: purging the above packages does not eliminate the stalls. It does reduce them, but they still happen occasionally. But I've verified that taking a fresh install of Raring through to just before installing landscape will not cause a stall. Once landscape is installed, the stalls start to happen.

Changed in linux (Ubuntu):
status: Incomplete → Invalid

Hi everybody.
I'm experiencing the same problem, with the difference that i'm not running a virtual machine. It's Ubuntu 12.10 on a fresh new install on iMac 7.1. It hangs, let's say, 6 or 7 times out of 10 anywhere from the first "checking battery state" (and my desktop, obviously doesn't have any battery)
I've googled around and i've found a workaround (sort of) but it's by no mean a real fix (i hit ctrl-alt-f1, then login, then "sudo su" and finally "pkill X". This restart X and i have my login screen back)
By the way, i'm EFI-dual-booting alongside with Mac OSX, if this informations can help.

Thanks in advance for any help

tags: removed: performing-bisect

Hi again everybody.
I've made some other test and i've discovered something i suppose could be a hint. As i've already said, i'm able to reach the lightdm login screen only, let's say, 3 or 4 times out of 10, but all the times, i.e. ten out of ten, i cannot see the "ubuntu starting" screen, with its typical five flashing dots. This leads me to think that it could be a problem wit KMS in EFI mode, because i've been able to see this screen at all times during the installation. The only difference, hence, is probably that the cd-media i used to boot the installer, started in bios mode (but i cannot be sure about that, actually i don't remember how i started the installer).
As further hint i can add that i started to have the same strange behavior in Fedora 17 with the difference that in Fedora by no means i could be able to reach a login screen while in Ubuntu it (occasionally) works :-).
I hope this information can help.

Thanks again

Hi. Fresh additional information.
Googling around i've stumbled in another bug which recalls this. Apparently lightdm is (occasionally) "killing himself".
Browsing the lightdm.log file i've found this:
[+03.05] Got signal 15 from process [the PID of lightdm itself!]

So lightdm is apparently killing himself.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in landscape-client (Ubuntu):
status: New → Confirmed
Robert Ancell (robert-ancell) wrote :

Closing assumed fixed.

Changed in lightdm (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers