kernel crashes at boot on a dell inspiron 531.

Bug #295091 reported by gbon on 2008-11-07
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Andy Whitcroft
Hardy
High
Andy Whitcroft

Bug Description

Binary package hint: linux-image-2.6.24-21-generic

I've been using ubuntu 8.04 (32 bit) on a dell inspiron 531 for a while.

All was fine with kernels up to 2.6.24-19, but since the upgrade to 2.6.24-21, the machine crashes
at every boot. The problem persists even after several minor upgrades in the 2.6.24-21.x series.

===

SRU Justification

Impact: systems with AMDx2 processors will panic in early boot

Fix Description: prevent creation of sysfs links for un-startable processors

Patch: http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=commit;h=f64c4a155a90228fc6415c3dea4e1bf14d82c99b

Risks: this code is run on all systems with acpi, but the patch is very simple and obvious

TEST CASE: boot hardy latest on and AMDx2

gbon (gbon) wrote :

Hi gbon,

Can you goot with the 'quiet' and 'splash' options removed. Are there any messages printed when the machine crashes? Would you be able to take a digital picture of these messages and attach it to this bug report. I'm also curious if you've tested the more recent Intrepid Ibex 8.10 that was released. Does this newer release also have the same issue? Please let us know. Thanks.

Changed in linux:
status: New → Incomplete
gbon (gbon) wrote :

Hi Leann.

Please find in attachment a sequence of snapshots taken during the boot process. Sorry for the bad quality.

8.10 works perfectly on the inspiron 531. The only missing bit is the LTS flag...

regards.

gbon (gbon) wrote :

this kernel broke a LTS release. I believe it's serious matter.

Changed in linux:
status: Incomplete → Confirmed
Changed in linux:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
status: Confirmed → Triaged
Stefan Bader (smb) on 2008-12-03
Changed in linux:
assignee: ubuntu-kernel-team → stefan-bader-canonical
status: Triaged → In Progress
Stefan Bader (smb) wrote :

Since the traces include i2c and sysfs, I would like to check whether one of the fixes upstream will help. I am currently uploading a -22.45smb1 kernel to my PPA at https://launchpad.net/~stefan-bader-canonical/+archive. If that has completed building, could you try this and check whether this does any good? Thank you.

Stefan Bader (smb) wrote :

The patch used is actually non-sense since it applies to I2O and not I2C. So if you could just wait for the next update to -proposed for Hardy which should happen soon (-23.46) and then test with that. Please let me know whether the problem still persists or not and whether the backtrace is the same or different.

On Thu, Dec 04, 2008 at 08:56:18AM -0000, Stefan Bader wrote:
> The patch used is actually non-sense since it applies to I2O and not
> I2C. So if you could just wait for the next update to -proposed for
> Hardy which should happen soon (-23.46) and then test with that. Please
> let me know whether the problem still persists or not and whether the
> backtrace is the same or different.
>

Hi Stefan,
in fact, I've followed Leann's suggestion to try Intrepid Ibex; and
since it worked flawlessly, I upgraded to 8.10.
However, if the upcoming 2.6.24-23.x kernel can be installed on Ibex
(dependency-wise), I will be glad to try it out, in a week or two.
Best regards,
-- g.b.

Stefan Bader (smb) wrote :

Hi gbon,

the kernel should be installable. Only downside probably is that this probably has to be done by getting the deb packages manually and using dpkg instead of apt. But I would be grateful if you could do that since I would like to resolve any potential regressions in Hardy.

gbon (gbon) wrote :

Hi Stefan.

I'll be able to try it out in a few days.

gbon (gbon) wrote :

hi Stefan,

I'm not sure about which kernel you mentioned. I tried http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-2.6.24-23-generic_2.6.24-23.46_i386.deb with no luck: see the attached photos.

sorry.
-- g.

Andy Whitcroft (apw) wrote :

@gbon -- thanks for testing that.

From the two sets of photos you have attached the behaviour looks the same. Its pretty clear something occurs when trying to register a link in sysfs, the initial panics in both case are EIP sysfs_create_link. We later get other softlockups trying take a lock in the same routine, so we clearly failed in sysfs_create_link with some lock held and it is now lost. We can ignore the later panics as they are mearly victims, their call chains are not particularly interesting.

Sadly in both cases we do not have the initial lines of the first dump (I am assuming it zooms past rather fast). It is this part of the output which is most important (of course). If you do get a chance to try and reproduce this could you add the "pause_on_oops=120" (to get a two minute pause when the first panic occurs) and then you could attempt to scroll up with Shift-PgUp and see if you can get a picture of the start of the panic.

Even if you are not able to test again, It would be useful to have dmesg output, lsmod output, and lspci -nnvv output from the system when working ok on Intrepid so we can see what this machine has in it.

Changed in linux:
assignee: stefan-bader-canonical → apw
gbon (gbon) wrote :

system info from intrepid's point of view, plus photos of the boot sequence.

Andy Whitcroft (apw) wrote :

@gbon -- cool that latest photo very nearly gets the top. It also gives us much more of a clue as to what was going on when it went bang as we do have the top of the stack at least. Thanks for testing.

I have attempted to add debug to the place I think things are going wrong and built a test kernel for you. Perhaps you could test that kernel and let me know what if anything you see. I have made it check some things and then print some debug and pause for 20s, it will do this twice if a successful boot, pictures of those would help. If I have targetted the error correctly we may see only one. Let me know what you see. Kernel at the URL below:

    http://people.ubuntu.com/~apw/lp295091-hardy/

Changed in linux:
assignee: nobody → apw
importance: Undecided → High
status: New → In Progress
Andy Whitcroft (apw) wrote :

This is confirmed as a not occuring on Intrepid. Closing out the non-hardy tasks which remains open.

Changed in linux:
status: In Progress → Invalid
Andy Whitcroft (apw) on 2009-01-23
Changed in linux:
status: In Progress → Incomplete
gbon (gbon) wrote :

Hi Andy,

I'll try your kernel in the next few days.

In the meantime, I've been wondering about the timestamps in the 2.6.24-23 kernel output: the kernel stops after a few seconds, but the timestamp reads 180: does that mean 3 minutes? If this is the case, the kernel timer is off by a factor ten w.r.t. real time. Is it possible that 2.6.24-23 fails to configure some hw timer? Perhaps wrong timings can explain problems in thread synchronization and spurious timeouts during device probes.

Is there anything I can do try other time sources or check that the source is working correctly?

-- g

Arv3n (xarv3nx) wrote :

Hi.

I've found a workaround: add acpi=off to grub.

I'll test the kernel in a sec., currently installing 8.04.2.

Andy Whitcroft (apw) on 2009-01-26
Changed in linux:
status: Incomplete → In Progress
gbon (gbon) wrote :

@Arv3n
in fact it works on my hw too. unfortunately disabling acpi has unpleasant side effects, e.g. no cpufreq support. I just prefer to run Intrepid.

@Andy Whitcroft.
your kernel pauses twice around the sysfs_create_link() call, as you can see in the photos.

regards
-- g

Andy Whitcroft (apw) wrote :

@gbon -- sorry for the long delay. I have respun the debug to try and get more information and to also slow down the panic itself when it occurs to make it more likely we get the top of the panic for you to take a picture of. New kernels are at the URL below, if you could give them a test and report back here:

    http://people.ubuntu.com/~apw/lp295091-hardy/

Changed in linux:
status: In Progress → Incomplete
gbon (gbon) wrote :

@andy

here are the photos. Immag005 is very interesting...
have a nice weekend

Andy Whitcroft (apw) on 2009-02-27
Changed in linux:
status: Incomplete → In Progress
Andy Whitcroft (apw) wrote :

@gbon -- thanks for those .. yes that photo is very enlightening. Will spin another kernel to try and get more directed debug on that. Will let you know when they are done.

Andy Whitcroft (apw) wrote :

Right from the images (specifically photo 0007 in the foto6 archive) we can see that we panic'd the second time we called acpi_processor_start(). In that case it appears that the sysdev was not initialised correctly. That you didn't take a picture of the initialiasation which also has a pause I assume that it was not emitted. Interesting.

I have built you a new test kernel, this one hopefully avoids the panic and continues (not fixed, just avoided). If so I'd like the whole dmesg from the boot. Either way if it gets further and we see further pauses I'd like pictures of those. Let me know. Kernels in their usual place:

    http://people.ubuntu.com/~apw/lp295091-hardy/

Changed in linux:
status: In Progress → Incomplete
gbon (gbon) wrote :

Andy,
here is dmesg and a set of photos. regards.
-- giuseppe

Andy Whitcroft (apw) on 2009-03-16
Changed in linux:
status: Incomplete → In Progress
Andy Whitcroft (apw) wrote :

@gbon -- thanks for that. The dmesg is most enlightening. It seems we have gotten into a spot of bother because we were unable to start the second CPU:

    [ 59.187433] CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+ stepping 03
    [ 59.187585] SMP alternatives: switching to SMP code
    [ 59.187949] Booting processor 1/1 eip 3000
    [ 78.776093] Not responding.
    [ 78.776147] Inquiring remote APIC #1...
    [ 78.776193] ... APIC #1 ID: 1000000
    [ 78.776382] ... APIC #1 VERSION: 80050010
    [ 78.776571] ... APIC #1 SPIV: ff
    [ 78.776768] CPU #1 not responding - cannot use it.

This then leads us to only initialise the 0th CPU:

    [ 79.002377] APW: registering cpus:
    [ 79.002424] APW: cpu=0
    [ 79.002470] APW: register_cpu cpu=0 sysdev=c0483e68
    [ 79.002531] APW: sysdev_register error=0
    [ 99.002016] APW: registering cpus complete

So when we try and make links for the active CPUs listed in ACPI we find we never initialised the second one, and blow up trying to use its sysfs object:

    [ 121.440692] APW: device<f7c5f800> sysdev<c0483e68> pr->id<0>
    [ 121.440744] APW: after link
    [ 121.440798] APW: acpi_processor_start complete
    [ 141.439407] APW: device<f7c5fc00> sysdev<00000000> pr->id<1>
    [ 141.439455] APW: sysdev NULL
    [ 161.439192] APW: link AVOIDED
    [ 161.439238] APW: after link

Bad. So there are two bugs. First, the system should boot your second CPU, which I assume we can see normally in your older/newer kernels. Second, the system should not explode if the CPUs do not start correctly.

Andy Whitcroft (apw) wrote :

@gbon -- can you confirm how many CPUs are seen on the good Hardy kernel, and also on your Intrepid kernels?

Andy Whitcroft (apw) wrote :

Ok, I think I understand why this regression has appeared. It looks very much that the hardy kernels are not correctly finding and booting the second cpus on AMD x2. This appears to have been separatly reported under bug #213011, which was FIX_COMMITTED _only_ in Intrepid.

In the window between Ubuntu-2.6.24-19.33 and Ubuntu-2.6.24-21.43 the commit below was added which triggers these new sysfs files, and it appears that this code in combination with the failure to start the second CPU leads to a non-booting system:

  commit c1d87cd9e086138b3d2326b75ae6d4d000d7c041
  Author: Zhang Rui <email address hidden>
  Date: Tue Apr 29 02:36:07 2008 -0400

    create sysfs link from acpi device to sysdev for cpu

    Bug: #248509

    Sys I/F under acpi device node and sysdev device node are both
    needed for cpu hot-removal. User space need this link so that
    they know they are poking the sys I/F for the same cpu.
    http://bugzilla.kernel.org/show_bug.cgi?id=9772

We can confirm this with the cpu counts as reported by the good hardy kernel, and by the good intrepid kernel, which I have already requested. If that confirms, then the fix for this bug will be to prevent the panic on creation of these sysfs files. Will spin a patch for the latter.

Andy Whitcroft (apw) wrote :

@gbon -- could you try the latest kernels below for me. I have hopefully fixed up your panic properly without all the delays and debug. If you could confirm this boots for you ok that would help muchly. Kernels at the URL below:

    http://people.ubuntu.com/~apw/lp295091-hardy/

Also if you could get the /proc/cpuinfo output from this kernel, from the good hardy kernel (-19) and from your intrepid kernel and attach them here that would help prove my theory on this bug.

Changed in linux:
status: In Progress → Incomplete
gbon (gbon) wrote :

@andy

here are the info you requested, relative to 2.6.24-19 (last good for hardy), 2.6.24-24 (your latest build) and 2.6.27-11 (intrepid current).

Your analysis is very accurate. I'm sorry for not noticing the missing core in the 2.4.24-19 kernel. that would have made things much easier.

best regards
-- gbon

Andy Whitcroft (apw) wrote :

@gbon -- excellent thanks. No need to appologise, its not something we would naturally look for until we are close to the trigger. Thank you for all your debug efforts even though you no longer needed to use Hardy.

From the cpuinfo we can clearly see we do not get two cpus in any Hardy kernel, working or not:

    cpuinfo-2.6.24-19-generic:processor : 0
    cpuinfo-2.6.24-24-generic:processor : 0

But do do so in Intrepid kernels:
    cpuinfo-2.6.27-11-generic:processor : 0
    cpuinfo-2.6.27-11-generic:processor : 1

So this would trigger the bug I have found in the sysfs handling. Will push this patch to our Hardy tree.

Andy Whitcroft (apw) on 2009-03-25
Changed in linux:
status: Incomplete → In Progress
Andy Whitcroft (apw) wrote :

Patch applied.

Changed in linux:
status: In Progress → Fix Committed
description: updated
Arv3n (xarv3nx) wrote :

Unforunately, this problem still exists on my Dell Inspiron 531, and I still have to apply the acpi=off boot option to allow my system to boot, even after upgrading all my packages to hardy-proposed.

We obviously have the same machine.. why isn't the "fix" working?

gbon (gbon) wrote :

@Arv3n

what kernel are you running, exactly? can you send the output of
cat /proc/version, please?

anyway, I suggest that you try the latest kernel from Andy. it's at
http://people.ubuntu.com/~apw/lp295091-hardy/linux-image-2.6.24-24-generic_2.6.24-24.51~lp295091apw1_i386.deb

regards.
-- gb

Arv3n (xarv3nx) wrote :

cat /proc/version:

Linux version 2.6.24-24-generic (buildd@rothera) (gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4)) #1 SMP Wed Apr 15 15:54:25 UTC 2009

trying his kernel now..

Arv3n (xarv3nx) wrote :

when i try to install it via gdebi it says a later version is already installed..

gbon (gbon) wrote :

@Arv3n

I've no idea what gdebi is. you can just
    sudo dpkg -i andys_kernel.deb

regards
-- gb

Martin Pitt (pitti) wrote :

Accepted linux into hardy-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

gbon (gbon) wrote :

@Martin

I tried the hardy-proposed kernel on top of my intrepid installation.

The machine boots, but the resulting (hybrid) system has poor graphics resolution and in general is not very usable. Probably, there are incompatibilities between the 2.6.24 kernel and the rest of the system (compiled against a later kernel).

I suggest that Arv3n tries the proposed kernel on his hardy installation and reports the results.

regards
-- gb

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.24-24.56

---------------
linux (2.6.24-24.56) hardy-proposed; urgency=low

  [Stefan Bader]

  * Rebuild of 2.6.24-24.54 with 2.6.24-24.55 security release applied

linux (2.6.24-24.54) hardy-proposed; urgency=low

  [Andy Whitcroft]

  * SAUCE: do not make sysdev links for processors which are not booted
    - LP: #295091

  [Brad Figg]

  * SAUCE: Add information to recognize Toshiba Satellite Pro M10 Alps Touchpad
    - LP: #330885
  * SAUCE: Add signatures to airprime driver to support newer Novatel devices
    - LP: #365291

  [Stefan Bader]

  * SAUCE: vgacon: Return the upper half of 512 character fonts
    - LP: #355057

  [Upstream Kernel Changes]

  * SUNRPC: Fix autobind on cloned rpc clients
    - LP: #341783, #212485
  * Input: atkbd - mark keyboard as disabled when suspending/unloading
    - LP: #213988
  * x86: mtrr: don't modify RdDram/WrDram bits of fixed MTRRs
    - LP: #292619
  * sis190: add identifier for Atheros AR8021 PHY
    - LP: #247889
  * bluetooth hid: enable quirk handling for Apple Wireless Keyboards in
    2.6.24
    - LP: #227501
  * nfsd: move callback rpc_client creation into separate thread
    - LP: #253004
  * nfsd4: probe callback channel only once
    - LP: #253004

 -- Stefan Bader <email address hidden> Sat, 20 Jun 2009 00:14:36 +0200

Changed in linux (Ubuntu Hardy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.