Bug #279186 “x86_64 kernel oops on boot (dual-core Atom 330 boar...” : Bugs : linux package : Ubuntu

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06:

#1

Oops do_page_fault in IRQ Edit (5.7 MiB, image/jpeg)

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06:

#2

Oops call_softirq Edit (5.6 MiB, image/jpeg)

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06:

#3

dmesg.txt Edit (38.4 KiB, text/plain)

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06:

#4

lspci-vvnn.log Edit (17.6 KiB, text/plain)

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06:

#5

oops.tar Edit (6.2 MiB, application/x-tar)

I have tested this even more. It looks like I can 100% reproduce OOPS (but with different stack traces) when I use "quiet" boot option. It seems that "splash" boot option is not an option, as I can reliably boot without problem if I remove "quiet", but leave "splash" option.

Will change bug description to match this.

I have also attached more oops photos where you can see different stack traces reported. It looks like there are at least three distinct functions where I get oops:

1. run_timer_softirq / clockevents_program_event / __do_softirq ... / smp_apic_apic_timer_interrupt

2. neigh_periodic_timer / clockevents_program_event / __do_softirq ... / smp_apic_apic_timer_interrupt

3. oops_begin / do_page_fault / enqueue_task_fair ... / error_exit / _i8042_interrupt

Revision history for this message

Luka Renko (lure) wrote on 2008-10-06: Re: kernel oops on boot with "quiet" option

#6

run_timer_softirq issue is reported on LKML and is supposed to be fixed in recent 2.6.27 rc kernels:
http://kerneltrap.org/mailarchive/linux-scsi/2008/9/28/3433344
http://www.kerneloops.org/searchweek.php?search=run_timer_softirq

Revision history for this message

richard mullens (richard-mullens) wrote on 2008-10-06:

#7

I have one of these boards also.
I'm running Ubuntu Intrepid Alpha 6 - x64.
I get kernel panics if I power off and then power on again within, say, 30 seconds.
If I wait before rebooting, the system comes up fine.
Once, I saw a diagnostic to the effect that there had been a thermal problem - but I can hardly believe this as the system runs really cool - though I suppose that the heatsink(s) may be badly mounted.

I'm using a picoPSU-60-WI fed by an 80W Sony laptop "brick" - and I've got a P4 connector plugged in of course.
The proximity of the PSU inductors to the memory card worries me a little.

Leann Ogasawara (leannogasawara) on 2008-10-06

Changed in linux:
assignee:	nobody → ubuntu-kernel-team
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Luka Renko (lure) wrote on 2008-10-07:

#8

Richard: do you get kernel panics only during the boot (like me) or also when the system has booted properly.
Actually you may be right regarding successful boot after being power-off for longer time: this may result in my impression that it did not panic 100% times.

Otherwise, new kernel currently being built has some softirq/htimers fixes that may be related:
  * hrtimer: migrate pending list on cpu offline
  * hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers
  * hrtimer: mark migration state
  * hrtimer: prevent migration of per CPU hrtimers
https://lists.ubuntu.com/archives/intrepid-changes/2008-October/008062.html
Will test again when this kernel hits the archives.

Just for completeness, I am using picoPSU-90 (with power adapter):
http://www.mini-box.com/picoPSU-90-power-kit

I use it in M300 enclosure with compact flash reader (but not used):
http://www.mini-box.com/M300-Enclosure-w-Bootable-CF-Reader_2

Revision history for this message

richard mullens (richard-mullens) wrote on 2008-10-07:

#9

Luka: I only get kernel panics during the boot.
Sometimes the system will stop during the boot while the Ubuntu process screen is being displayed. I'm not sure if that is a "panic" - but usually I get a screenfull of messages when the problem occurs prior to the Ubuntu process screen.

I also see the problem if I do a "restart" of the system - like after updates have been applied.

My system has been running all night using bittorrent without a problem - and I'm typing this on it now.

Thanks for your suggestions regarding the kernel.

For completeness I should say that the changes I made to my bios settings seem not to be permanent:
Booting from USB - seemed to be lost - as did automatically starting after a power failure. I guess this is unrelated.

My system isn't in a box yet.

Revision history for this message

Luka Renko (lure) wrote on 2008-10-07:

#10

oops-2.6.27-6.tar Edit (1.0 MiB, application/x-tar)

No luck with new kernel (2.6.27-6). :-(

It is slightly better, I was able to get boot after three panics, so it does not happen always. I have attached two photos that I managed to take (one panic did reboot before my phone camera took the picture).

One oops is the same as before (call_softirq), other seems to be similar to neigh_periodic_timer.

Revision history for this message

Luka Renko (lure) wrote on 2008-10-13:

#11

Same problem with latest kernel 2.6.27-7-generic

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2008-10-17:

#12

Per some discussion with the kernel team I'm assigning a few bugs directly to a developer.

Changed in linux:
assignee:	ubuntu-kernel-team → amitk

Revision history for this message

Luka Renko (lure) wrote on 2008-10-19:

#13

oops photo with 2.6.26-7.12 Edit (492.8 KiB, image/jpeg)

This is getting worse with latest kernel 2.6.27-7.12: now I always get Oops on boot of the kernel and even removal of quiet and splash options does not help. Luckily I still had 2.6.27-6 on the system so that I can boot it (without "quiet" though).

Good thing (I think) is that now the Oops looks the same all the time - see attached photo.

Revision history for this message

richard mullens (richard-mullens) wrote on 2008-10-19:

#14

Leann:
Are the developers able to reproduce this/these crashes ?
It looks like a duplicate to this has been reported - bug 285518 - with additional useful information.
Thanks
Richard

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-19:

#15

2.6.27-7.12 console boot Edit (30.2 KiB, text/plain)

Hi guys, I took a look at your screen shots, this looks to be the same bug i posted as https://bugs.launchpad.net/bugs/285518

I'm attaching my serial terminal dump which shows a lot more information of what is happening.

I rebuilt linus's mainline kernel 2.6.27 (release) and my atom 330 (D945GCLF2) boot just fine. I'm not sure if the splash screen actually comes up, since my vga->tv adapter doesn't really kick in till gnome loads..

I'd post the linux-image deb, however it's in the neighborhood of 235Mb's...

Quick fix, till ubuntu's kernel is fixed. (not sure if it's compatible with any of ubuntu's extra drivers, but the atom 330 only comes on a D945GCLF2 at this point) (anyone want to do a git bisect between linus's 2.6.27 and ubuntu's ;) )

#Quick fix
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
cd linux-2.6/
git checkout v2.6.27
cp /boot/config-2.6.27-7-generic .config
make oldconfig (no's)
make menuconfig
(uncheck "Paravirtualized guest support")
make-kpkg clean
CONCURRENCY_LEVEL=2 fakeroot make-kpkg --initrd --append-to-version=-custom kernel_image kernel_headers
(wait about 3-4 hours)
sudo dpkg -i linux-*
sudo reboot

Related to : http://webui.sourcelabs.com/kernel/issues/11157

Regards,

Robert

Revision history for this message

Uwe Helm (1forthedoctor) wrote on 2008-10-20:

#16

Robert,

Ubuntu's git kernel tree it has vanilla tags as well, you could do an easier bisect there.
git://kernel.ubuntu.com/ubuntu/ubuntu-intrepid.git

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-20: Re: kernel oops on boot (dual-core Atom 330 board D945GCLF2)

#17

Untested diff for alternative.c Edit (950 bytes, text/plain)

Taking a shot in the dark, once you remove drivers/ubuntu & debian directories there are 141 changed files (385K diff), between Ubuntu-2.6.27-7.12 and Linus's 2.6.27 Release. Based on the log's and screen shot it looks irq/smp related, I'm going to attempt building the kernel with this patch over my lunch break. (patch-alternative.c-linus-2.6.27.diff). It will remove a couple changes done to Ubuntu-2.6.27-7.12's alternative.c file, should have an answer later tonight. (3-4 hours to build with make-kpkg)

Regards,

Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-20:

#18

Nope, alternative.c isn't the problem. i really hope it wasn't the fuse layer, their are quite a few changes. Did any version of 2.6.27 work perfectly for anyone? I'll start a git bisect from there.

Robert

Revision history for this message

richard mullens (richard-mullens) wrote on 2008-10-21:

#19

I am using kernel Linux 2.6.27-3-generic (that's what System Monitor says).
With that kernel (from Intrepid Alpha 6), I can boot providing my system has been powered off for some time (shall we say a minute). With the most recent kernels it has been impossible to boot it seems.

I have not tried any earlier versions of the kernel. When I received my board I downloaded Alpha 6.

Probably irrelevant but my BIOS is LF94510J.86A.0099.2008.0731.0303
There is a more recent BIOS - LF94510J.86A.0103.2008.0814.1910
but I have not installed it.

Perhaps the module at fault (if it is just one) is one which has had multiple changes made to it.

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-21:

#20

Just a quick update on git bisecting, 10 left to go:

Last Good: 184df5deeb7e41dbb610712d0233a4442cfb1ee6
Last Bad: be63c2707b64da6b2391d03d8e91805793d7adfe

http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-intrepid.git;a=tags

There is quite a few vesafb changes done between those commit's, that would make sense specially since non-splash mode's correctly boot for some kernels.

Richard -
The bios doesn't help in this case.. (Although the newer one will allow you to actually select a different boot order, USB device (cd-rom) before the main harddrive. quite useful in my case about a week ago..)

Regards,
Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-22:

#21

Just a quick update, down too the last few bisect's building them to be 100% sure.

Last Good: 5b28a2ae632698758bf84364a5d87c6fcaaff03b

UBUNTU: Remove depmod created files from packages.
Bug: #250511

All depmod files are created immediately upon installation.

Signed-off-by: Tim Gardner <email address hidden>

Last Bad: 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7

disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload

    While debugging the e1000e corruption bug with Intel, we discovered
    today that the dynamic ftrace code in mainline is the likely source of
    this bug.

This leaves as the first possible bad commits:

0020f5cf510e3225fc4b0eefdcfdc0caba2f2097

x86: register a platform RTC device if PNP doesn't describe it

Most if not all x86 platforms have an RTC device, but sometimes the RTC
is not exposed as a PNP0b00/PNP0b01/PNP0b02 device in PNPBIOS or ACPI:

http://bugzilla.kernel.org/show_bug.cgi?id=11580
https://bugzilla.redhat.com/show_bug.cgi?id=451188

It's best if we can discover the RTC via PNP because then we know
which flavor of device it is, where it lives, and which IRQ it uses.

But if we can't, we should register a platform device using the
compiled-in RTC_PORT/RTC_IRQ resource assumptions.

    Signed-off-by: Bjorn Helgaas <email address hidden>
    Acked-by: Rafael J. Wysocki <email address hidden>
    Acked-by: David Brownell <email address hidden>
    Reported-by: Rik Theys <email address hidden>
    Reported-by: <email address hidden>
    Signed-off-by: Linus Torvalds <email address hidden>
    Signed-off-by: Tim Gardner <email address hidden>

or:

c25072126a591fec9481197aab361469661c050e

UBUNTU: SAUCE: Add back in lost commit for Apple BT Wireless Keyboard

    OriginalAuthor: Dean McCarron
    OriginalLocation: http://sourceforge.net/mailarchive/message.php?msg_id=473118CB.5090804%40mercuryresearch.com
    Bug: #162083
    Ignore: no
    This patch was present in Hardy, but got dropped by accident in Intrepid.
    I've submitted it upstream so it can get into Jaunty, but will need to be
    carried as extra baggage for now in Intrepid.

Signed-off-by: Mario Limonciello <email address hidden>
Signed-off-by: Tim Gardner <email address hidden>

Since c25072126a591fec9481197aab361469661c050e only changes a keyboard, it's more then likely, 0020f5cf510e3225fc4b0eefdcfdc0caba2f2097 "x86: register a platform RTC device if PNP doesn't describe it" does not work on Intel Atom 330 D945GCLF2.

Like i mentioned, i'm building these last few commits to be 100% sure, will have a final update tomorrow.

Regards,

Robert

Just a quick update, down too the last few bisect's building them to be 100% sure.

Last Good: 5b28a2ae632698758bf84364a5d87c6fcaaff03b

UBUNTU: Remove depmod created files from packages.
    Bug: #250511
    
    All depmod files are created immediately upon installation.
    
    Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Last Bad: 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7

disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload
    
    While debugging the e1000e corruption bug with Intel, we discovered
    today that the dynamic ftrace code in mainline is the likely source of
    this bug.
    
This leaves as the first possible bad commits:

0020f5cf510e3225fc4b0eefdcfdc0caba2f2097

x86: register a platform RTC device if PNP doesn't describe it
    
    Most if not all x86 platforms have an RTC device, but sometimes the RTC
    is not exposed as a PNP0b00/PNP0b01/PNP0b02 device in PNPBIOS or ACPI:
    
        http://bugzilla.kernel.org/show_bug.cgi?id=11580
        https://bugzilla.redhat.com/show_bug.cgi?id=451188
    
    It's best if we can discover the RTC via PNP because then we know
    which flavor of device it is, where it lives, and which IRQ it uses.
    
    But if we can't, we should register a platform device using the
    compiled-in RTC_PORT/RTC_IRQ resource assumptions.
    
    Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
    Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
    Acked-by: David Brownell <dbrownell@users.sourceforge.net>
    Reported-by: Rik Theys <rik.theys@esat.kuleuven.be>
    Reported-by: shr_msn@yahoo.com.tw
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

or:

c25072126a591fec9481197aab361469661c050e

UBUNTU: SAUCE: Add back in lost commit for Apple BT Wireless Keyboard
    
    OriginalAuthor: Dean McCarron
    OriginalLocation: http://sourceforge.net/mailarchive/message.php?msg_id=473118CB.5090804%40mercuryresearch.com
    Bug: #162083
    Ignore: no
    This patch was present in Hardy, but got dropped by accident in Intrepid.
    I've submitted it upstream so it can get into Jaunty, but will need to be
    carried as extra baggage for now in Intrepid.
    
    Signed-off-by: Mario Limonciello <superm1@ubuntu.com>
    Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Since c25072126a591fec9481197aab361469661c050e only changes a keyboard, it's more then likely, 0020f5cf510e3225fc4b0eefdcfdc0caba2f2097 "x86: register a platform RTC device if PNP doesn't describe it" does not work on Intel Atom 330 D945GCLF2.

Like i mentioned, i'm building these last few commits to be 100% sure, will have a final update tomorrow.

Regards,

Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-22:

#22

Download full text (3.9 KiB)

Okay, this really doesn't make sense. (config-2.6.27-7.12 has it disabled) not what i was hoping for, doesn't really make sense. git bisect results (Atom 330 D945GCLF2, x86_64), this more then likely will not get reverted. (since it can damage hardware) But will have to be modified to work with this cpu.

Any Ideas?

Regards,
Robert

67d9b90a1c844bf1c6daaffd2c60561fc8c445f7 is first bad commit
commit 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7
Author: Steven Rostedt <email address hidden>
Date: Wed Oct 15 18:21:44 2008 -0400

disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload

    While debugging the e1000e corruption bug with Intel, we discovered
    today that the dynamic ftrace code in mainline is the likely source of
    this bug.

For the stable kernel we are providing the only viable fix patch: labeling
CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)

    We will follow up with a backport patch that contains the fixes. But since
    the fixes are not a one liner, the safest approach for now is to
    disable the code in question.

    The cause of the bug is due to the way the current code in mainline
    handles dynamic ftrace. When dynamic ftrace is turned on, it also
    turns on CONFIG_FTRACE which enables the -pg config in gcc that places
    a call to mcount at every function call. With just CONFIG_FTRACE this
    causes a noticeable overhead. CONFIG_DYNAMIC_FTRACE works to ease this
    overhead by dynamically updating the mcount call sites into nops.

    The problem arises when we trace functions and modules are unloaded.
    The first time a function is called, it will call mcount and the mcount
    call will call ftrace_record_ip. This records the calling site and
    stores it in a preallocated hash table. Later on a daemon will
    wake up and call kstop_machine and convert any mcount callers into
    nops.

    The evolution of this code first tried to do this without the kstop_machine
    and used cmpxchg to update the callers as they were called. But I
    was informed that this is dangerous to do on SMP machines if another
    CPU is running that same code. The solution was to do this with
    kstop_machine.

    We still used cmpxchg to test if the code that we are modifying is
    indeed code that we expect to be before updating it - as a final
    line of defense.

    But on 32bit machines, ioremapped memory and modules share the same
    address space. When a module would load its code into memory and execute
    some code, that would register the function.

    On module unload, ftrace incorrectly did not zap these functions from
    its hash (this was the bug). The cmpxchg could have saved us in most
    cases (via luck) - but with ioremap-ed memory that was exactly the wrong
    thing to do - the results of cmpxchg on device memory are undefined.
    (and will likely result in a write)

    The pending .28 ftrace tree does not have this bug anymore, as a general push
    towards more robustness of code patching, this is done differently: we do not
    use cmpxchg and we do a WARN_ON and turn the tracer off if anything devia...

Okay, this really doesn't make sense. (config-2.6.27-7.12 has it disabled) not what i was hoping for, doesn't really make sense.  git bisect results (Atom 330 D945GCLF2, x86_64), this more then likely will not get reverted. (since it can damage hardware) But will have to be modified to work with this cpu.

Any Ideas?

Regards,
Robert

67d9b90a1c844bf1c6daaffd2c60561fc8c445f7 is first bad commit
commit 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7
Author: Steven Rostedt <rostedt@goodmis.org>
Date:   Wed Oct 15 18:21:44 2008 -0400

disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload
    
    While debugging the e1000e corruption bug with Intel, we discovered
    today that the dynamic ftrace code in mainline is the likely source of
    this bug.
    
    For the stable kernel we are providing the only viable fix patch: labeling
    CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)
    
    We will follow up with a backport patch that contains the fixes. But since
    the fixes are not a one liner, the safest approach for now is to
    disable the code in question.
    
    The cause of the bug is due to the way the current code in mainline
    handles dynamic ftrace.  When dynamic ftrace is turned on, it also
    turns on CONFIG_FTRACE which enables the -pg config in gcc that places
    a call to mcount at every function call. With just CONFIG_FTRACE this
    causes a noticeable overhead.  CONFIG_DYNAMIC_FTRACE works to ease this
    overhead by dynamically updating the mcount call sites into nops.
    
    The problem arises when we trace functions and modules are unloaded.
    The first time a function is called, it will call mcount and the mcount
    call will call ftrace_record_ip. This records the calling site and
    stores it in a preallocated hash table. Later on a daemon will
    wake up and call kstop_machine and convert any mcount callers into
    nops.
    
    The evolution of this code first tried to do this without the kstop_machine
    and used cmpxchg to update the callers as they were called. But I
    was informed that this is dangerous to do on SMP machines if another
    CPU is running that same code. The solution was to do this with
    kstop_machine.
    
    We still used cmpxchg to test if the code that we are modifying is
    indeed code that we expect to be before updating it - as a final
    line of defense.
    
    But on 32bit machines, ioremapped memory and modules share the same
    address space. When a module would load its code into memory and execute
    some code, that would register the function.
    
    On module unload, ftrace incorrectly did not zap these functions from
    its hash (this was the bug). The cmpxchg could have saved us in most
    cases (via luck) - but with ioremap-ed memory that was exactly the wrong
    thing to do - the results of cmpxchg on device memory are undefined.
    (and will likely result in a write)
    
    The pending .28 ftrace tree does not have this bug anymore, as a general push
    towards more robustness of code patching, this is done differently: we do not
    use cmpxchg and we do a WARN_ON and turn the tracer off if anything deviates
    from its expected state. Furthermore, patch sites are statically identified
    during build time so there's no runtime discovery of dynamic code areas
    anymore, and no room for code unmaps to cause the hash to become out of date.
    
    We believe the fragility of dynamic patching has been sufficiently
    addressed in the development code via the static patching method, but further
    suggestions to make it more robust are welcome.
    
    Signed-off-by: Steven Rostedt <srostedt@goodmis.org>
    Acked-by: Ingo Molnar <mingo@elte.hu>
    Acked-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
    Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

:040000 040000 22d7188976a687f1a02ca43ee8a55e2202e10397 e7e8cf17d28e09c98efac31848bae727973ca8b8 M	kernel

Revision history for this message

Amit Kucheria (amitk) wrote on 2008-10-24:

#23

Are you still able to reproduce it with 2.6.27-7.14?

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-24:

#24

Hi Amit,

It still exists in 2.6.27-7.14 (x86_64)

Regards,
Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-30:

#25

No Change (bug still exists) with 2.6.27-7.15 (x86_64), now that 8.04 is released i'll update the sources to follow the intrepid-proposed tree.. For Ubuntu kernel maintainers, let me know when you need another serial console dump.

Otherwise commit 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7 breaks Atom 330 board D945GCLF2.

Regards,
Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-10-30:

#26

Obvious s/8.04/8.10/

Robert

Revision history for this message

Luka Renko (lure) wrote on 2008-11-08:

#27

No change with 2.6.27-8.17 (x86_64) which is currently in intrepid-proposed. :-(

Revision history for this message

Luka Renko (lure) wrote on 2008-11-09:

#28

I have just installed Kubuntu 8.10 release, but this time i386 packages - everything works without problems.
It seems that this issue is clearly related to 64-bit version of kernel only (x86_64).

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-09:

#29

Retested with 2.6.27-8.17, Reverting 67d9b90a1 still fixes the issue... 2.6.27-8.17 (x86_64)

commit c39815a9d77967dd058175b89457db5ca81d80cf
Author: Robert Nelson <voodoo@myth-living.(none)>
Date: Sun Nov 9 07:39:56 2008 -0600

Revert "disable CONFIG_DYNAMIC_FTRACE due to possible memory corruption on module unload"

This reverts commit 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7.

commit f36789c71cde4689aeb9b80aafc6895b270b05d2
Author: Tim Gardner <email address hidden>
Date: Wed Nov 5 13:42:32 2008 -0700

UBUNTU: Ubuntu-2.6.27-8.17
Ignore: yes

Signed-off-by: Tim Gardner <email address hidden>

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-12:

#30

I'm not a kernel expert but I'm having the same issue with the "little sister" of your board, the D945GCLF with the Intel Atom 230 (single core, HT capable).

I've tried to install the intrepid x86_64 server edition and:
- with the iso, I get an early kernel panic with an hard lock.
- with the netboot install I'm able to install the system only with the 20080522ubuntu21 version (passing acpi=off). The other two version, including the "current", give always a kernel panic.
As said, with that particular version of netboot installer I'm able to install the system, but after the first reboot I get only kernel panics.

With x86 kernel I have no problem to install ubuntu and also boot a live cd.

If it can be useful, the debian etch netboot install for x86_64 works perfectly (but it have a different kernel version).

Thanks.
Yusef Maali

Revision history for this message

Dan (ldskjdfjsl83) wrote on 2008-11-12:

#31

I want to corroborate Yusef Maali's comment; I have the D945GCLF single core atom 230 motherboard and the latest distro (downloaded last night) gives a kernel panic when trying to run intrepid x86_64 from the live CD and even when trying to verify the live CD from the Live boot menu. For some reason the Live CD memory check works; it probably uses the 32-bit kernel. I get exactly the same screen reported in Luka Renko's 10/19 post: http://launchpadlibrarian.net/18682151/18102008.jpg.

I sure am glad this community is here and knows about the problem. I tried two different motherboards, two different ram sticks and downloaded the distro twice from two different sources and checked MD5 sums of the distros and the CD's burned from them - all identical.

To Amit Kucheria: I'm a software developer and I know how difficult these kinds of bugs are to find and fix. I'm rootin' for ya.

Thanks,
Dan

Revision history for this message

Jon Jennings (jon100) wrote on 2008-11-13:

#32

See also my comments & screenshots attached to Bug #54020

As with Yusef & Dan I'm using the single core D945GCLF board/processor.
In brief, 8.04.1 32&64 bit are fine, 8.10 64 bit panics immediately after boot-up menu for installation or CD check.

One further piece of information... I still get the panics if I disable hyper-threading.

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-13:

#33

Hi Jon, Dan & Yusef,

Thru git bisecting on the atom 330 i found commit 67d9b90a1c844bf1c6daaffd2c60561fc8c445f7 to be the issue with x86_64. Can you confirm this on the single core Atom? (you may have to install with alpha-5/6 x86_64 to get an initial bootable kernel...)

git clone git://kernel.ubuntu.com/ubuntu/ubuntu-intrepid.git
cd ubuntu-intrepid/
git checkout Ubuntu-2.6.27-8.17
cp /boot/config-2.6.27-7-generic .config
make oldconfig (no's)
make menuconfig
(uncheck "Paravirtualized guest support")
make-kpkg clean
fakeroot make-kpkg --initrd --append-to-version=-custom kernel_image kernel_headers
(wait about 3-4 hours)
sudo dpkg -i linux-*
sudo reboot

Sadly, even if you confirm this, commit 67d9b90a is a failsafe to keep the kernel from overwriting/breaking some Intel 1Gb network adapters.

Regards,
Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-13:

#34

git checkout Ubuntu-2.6.27-8.17
git revert 67d9b90 <- opps, don't forget this..
cp /boot/config-2.6.27-7-generic .config

Revision history for this message

Dan (ldskjdfjsl83) wrote on 2008-11-14:

#35

I'm a newbie to linux development so I'd have to learn quite a bit to follow the above instruction. I'm guessing that 'git' means something like 'get latest version of code from source repository' and 'make' invokes the compiler and linker but I don't know what the "(no's)" refers to. There's some other things I'm puzzled by as well. Also, If building the code on your system takes 3-4 hours, on the atom it might take 8 or longer.

I'm guessing that to run these instructions in the terminal application, I'd need to be running 8.4.1 on the hard drive (not the live cd) and that after running the instructions, the system might not boot any longer. So I'd need to image the drive first and then restore the drive from the drive image after the test. I don't actually have a spare machine, so I could see this putting my main machine out of commission for a day or longer.

It sure would be a lot easier for me if someone has a live CD iso with the build already on it. I could very quickly burn a CD with the iso and report back what happens when I try to boot off of it.

Revision history for this message

Andre Blum (andre-blum-home) wrote on 2008-11-16:

#36

Robert, Dan,

I confirm that reverting 67d9b90 according to your instructions gives a bootable system also on a single core atom board:

andre@atom:~$ cat /proc/cpuinfo | grep model
model : 28
model name : Intel(R) Atom(TM) CPU 230 @ 1.60GHz
model : 28
model name : Intel(R) Atom(TM) CPU 230 @ 1.60GHz
andre@atom:~$ uname -a
Linux atom 2.6.27.4-custom #1 SMP Sun Nov 16 12:13:24 CET 2008 x86_64 GNU/Linux
andre@atom:~$

(Note that two processors are listed due to hyperthreading.)

Thanks! Now the question remains how this needs to be resolved in mainstream kernels

Regards,
Andre

Revision history for this message

Andre Blum (andre-blum-home) wrote on 2008-11-16:

#37

and it took 5 hours 50 minutes ;-)

Andre

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-16:

#38

Thanks for the update Andre and taking the time to recompile the kernel.

- Amit/Luka Can you bump the heading to include the Atom 230 (x86_64)...

It's been awhile, (playing with mainline 2.6.28-rc's for gem/kms/etc..), but Linus's original 2.6.27 release did not show this problem. (I never tested 2.6.27.1+) and that's where is started the git bisect from.

Anyone know the best way to bring this up directly to the kernel-dev?

I'd like to send a message like:

- Kernel-dev's

Commit 67d9b90 enables a patch that prevents the flash corruption in specific intel lan 1Gb adapters. From what we found in Bug 279186 it also prevents both the Atom 230/330 to boot (x86_64), now that it's been fixed in mainline 2.6.27.1+ and ubuntu's, sync'd to 2.6.27.4 (8.17) can it be fully reverted?

Regards,

Robert

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-16:

#39

Robert,

in my atom box the "patched" kernel still doesn't work...

I have to admit that I have used no conventional method:
I have compiled the kernel with a live 8.10 x86_64 kubuntu on a Core2Duo (and it compiles in less than 5 hours ;)
I have booted the atom box with a netboot debian live (because it hasn't a cdrom drive)
I have installed the new kernel chrooting inside the broken installation.

I get, again, a kernel panic.

Anyway, now I will try to compile directly on the atom box, hoping this gives a working kernel...
BTW, where I can find an alpha5/6 iso to download? I'm not able to find it on the net... :(

Regards,
Yusef

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-16:

#40

Yusef, strange...

Well the only pre-final amd64 i could find was alpha-3. PS, I saw no difference when building it on an athlon x2 vs an atom 330, so don't waste the time building on a slower platform and transfer the *.deb over.

http://old-releases.ubuntu.com/releases/intrepid/

Regards,

Robert

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-17:

#41

I have done my test with an Ubuntu Hardy server (8.04.1) x86_64.

The new kernel doesn't work... :(
I get the same kernel panic...

It is behaving in a very strange way, because the third time I've tried to boot, it worked!!
After that, only kernel panics.

Very very strange...
I need to buy a serial cable to log the kernel startup.

- Andre: it took exactly 3 hours to compile all the stuff... May you haven't pass CONCURRENCY_LEVEL=2 to make-kpkg?

- Robert: if you would like to do more test, just tell me. And if you would like to try the kernel package I have created I can upload them somewhere.

Anyway, the only difference I have from a brand new D945GCLF is the bios version I have upgraded to the latest version 0103 (LF94510J.86A.0103.2008.0814.1910).

Yusef

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-17:

#42

config-2.6.27-7-generic Edit (83.3 KiB, text/plain)

Hi Yusef,

That boot pattern is similar to what i saw in alpha-5/6.. (50/50 chance it would work)

The kernel in my case was around 250-300MB's, so I won't ask for you to post it. Can you post your .config used? I'll rebuild it. For reference i've attached the one pulled from 2.6.27-7, config-2.6.27-7-generic, that i was using, and the first 30 or so lines of "git log"? (that way i can make an exact match)

3 hours? Hum, I should really try CONCURRENCY_LEVEL=4.... Make sure it's a null modem..

Thanks

Robert

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-17:

#43

config-2.6.27.4-custom Edit (88.1 KiB, text/plain)

Robert,
thanks for your help!

Yes, it is a random pattern... sometimes it boots, sometimes it don't...
Also in my case the kernel is around 250Mb, as you suggest I have attached my .config

I'm rebuilding with your config file. With a simple diff, I'm not able to see if there are differences between the two configs.

Yep, 3 hours and 2 minutes.
Of course, a null modem ;)

Yusef

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-17:

#44

git.log Edit (4.0 KiB, text/plain)

and the git log (~100 lines...)

Thanks again.
Yusef

Revision history for this message

Yusef Maali (usef) wrote on 2008-11-17:

#45

But the config file you have attached, is the one installed with a clean Intrepid x86_64 server edition?
I was started from that one.

Thanks,
Yusef

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-17:

#46

Thanks Yusef!

Not a complete expert on kernel options, but there's one thing in the config file that might have hit the bug:

--- config-2.6.27.4-custom 2008-11-16 21:33:14.000000000 -0600
+++ config-2.6.27-7-generic 2008-11-16 20:14:44.000000000 -0600
(lots of lines)
CONFIG_TRACING=y
-CONFIG_FTRACE=y <- This could be related?
-# CONFIG_IRQSOFF_TRACER is not set
-# CONFIG_SYSPROF_TRACER is not set
-# CONFIG_SCHED_TRACER is not set
-CONFIG_CONTEXT_SWITCH_TRACER=y
-# CONFIG_DYNAMIC_FTRACE is not set <- this was disabled to begin with, but the git bisect/revert pointed to this as the issue..

Regards,

Robert

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-17:

#47

Hi Yusef,

That's right you are using the server edition (with server kernel), the one i posted was for the standard intrepid desktop kernel (amd64 7.16). That might require a full git bisect from a working kernel. Do standard and server editions share the same git tree?

Regards,

Robert

Revision history for this message

Andre Blum (andre-blum-home) wrote on 2008-11-17:

#48

Over 10 reboots I have a 100% success rate.
My BIOS is 2008.0427.2223
I followed Robert's exact steps, using 2.6.27-7-generic (not server) kernel config

Revision history for this message

Yanko Kaneti (yaneti) wrote on 2008-11-21:

#49

People still experiencing this issue might try the new bios update from intel ver.0122. It seems to have helped here on a D945GCLF and Fedora rawhide x86_64.

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-21:

#50

Thanks Yanko,

The release notes look promising, I'll give it a try over lunch on my Atom 330, now just to find a winxp harddrive and floppy drive... ;)

http://downloadcenter.intel.com/filter_results.aspx?strTypes=all&ProductID=2926&OSFullName=OS+Independent&lang=eng&strOSs=38&submit=Go

BIOS Version 0122

About This Release:
• November 18, 2008
• LF94510J.86A.0122.2008.1117.0113
• VBIOS info:
Build Number: 1374 PC 14.12 08/28/2006 16:30:16
• PXE info:
Realtek* RTL8111B/8111C/8101E/8102E Ethernet Controller v2.171
(080703)

New Fixes/Features:
• Fixed a potential system hang issue due to infinite loop in the
variable function.

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2008-11-21:

#51

Thanks Yanko,

Bios LF94510J.86A.0122.2008.1117.0113, seems to do the trick, so far after 2-3 reboots no issues..

Linux myth-living 2.6.27-7-generic #1 SMP Tue Nov 4 19:33:06 UTC 2008 x86_64 GNU/Linux

Everyone else, give this a try, and report your results.

Regards,

Robert

Revision history for this message

Andre Blum (andre-blum-home) wrote on 2008-11-21:

#52

Robert, All,

Yes, with this BIOS upgrade I was able to boot the ubuntu generic kernel on the atom 230 (single core) as well.
Great work.

Regards
Andre

Revision history for this message

Jon Jennings (jon100) wrote on 2008-11-21:

#53

Fantastic.
With the new BIOS, 8.10 Alternate Desktop 64-bit CD now installs, reboots and md5 checks its CD perfectly.

This on an Intel D945GCLF motherboard/CPU

Many thanks to whoever here or at Intel has worked on this. This is going to make a very cute little file server.

Revision history for this message

Dan (ldskjdfjsl83) wrote on 2008-11-22:

#54

Cool, the BIOS upgrade fixed it for me too! I can now fully boot the 64-bit live CD all the way to the desktop. Thanks, everybody!

Now, if I just could just get the integrated Ethernet working... The 32-bit version of 8.10 works fine with the integrated Ethernet. Well, that's a different problem for another day.

Cheers.

Revision history for this message

Andy Whitcroft (apw) wrote on 2008-11-28:

#55

This seems to have been fixed via a BIOS update from Intel. So moving this bug closed; I am marking it Invalid because it was not a bug in Linux not because it was not a bug. If you are still seeing issues with this BIOS update installed please re-open this bug by moving the linux task 'New'. Thanks.

Changed in linux:
status:	Triaged → Invalid

Revision history for this message

Robstarusa (rob-naseca) wrote on 2008-12-22:

#56

This does not fix it for those of us NOT using the intel board. I have an MSI wind PC ("nettop") and I am still having this issue.

Revision history for this message

Robstarusa (rob-naseca) wrote on 2008-12-22:

#57

Whoops, I have the atom230. Same issue however.

Revision history for this message

Yusef Maali (usef) wrote on 2008-12-22:

#58

For the intel board it was a purely Bios issue.
Please check if you have an upgrade for your netbook.
Otherwise I think you should write to the MSI support, highlighting this bug report. MSI may have to update their Bios's board.

Yusef.

Revision history for this message

J. Alexander Jacocks (jjacocks) wrote on 2009-01-02:

#59

As an additional note, I have the same Intel LF2 Atom 330 board, and the current BIOS (0137, dated 12/19/2008) causes the same panic. Does anyone else here see the same behavior?

Revision history for this message

Robert Nelson (robertcnelson) wrote on 2009-01-02:

#60

J. Alexander,

I wonder if they re'broke' it with the 12/19/2008 Bios.

All my testing was done with the Intel D945GCLF2 atom 330 based motherboard and the bios version "0122 - 11/18/2008" fixed all issues. ( i never flashed the 12/19 update)

Intel's archive versions can be found here:

http://downloadcenter.intel.com/Filter_Results.aspx?strOSs=38&strTypes=all&ProductID=2926&OSFullName=OS%20Independent&lang=eng&sType=prev

Can you confirm if the older 11/18 bios works in your D945GCLF2 board? If it fixes the problem, it might be time to email intel on this issue and include this bug report..

Regards,
Robert

Revision history for this message

J. Alexander Jacocks (jjacocks) wrote on 2009-01-02:

#61

Well, it looks like the problem might be in the 0137 flash image, because flashing the specific version listed in this bug allows the system to boot.

Revision history for this message

MarkG (movieman523) wrote on 2009-01-07:

#62

I'm just building an Atom system for MythTV based on the D945GCLF2 motherboard, and it wouldn't boot Mythbuntu 8.10 x64 with the 099 BIOS, but I upgraded to the latest 140 BIOS and now it's going through the CD check prior to installing. So that version appears to be OK.

Revision history for this message

MarkG (movieman523) wrote on 2009-01-07:

#63

For the record, that system is now installed, booting and running Mythbuntu 8.10 x64 quite happily with the 140 BIOS.

Revision history for this message

Robstarusa (rob-naseca) wrote on 2009-01-08:

#64

MSI won't FIX (WIND PC bought @ egg). I'd avoid the MSI. I guess I'm running i386 only....

Revision history for this message

aeneas (aeneascarver) wrote on 2009-03-03:

#65

This also affects the new Shuttle X27D Barebone (Dual Core Intel Atom 330). I hope they will update their BIOS :-( (http://global.shuttle.com/download03.jsp?PI=1209&PL=1)

I found no way to boot this machine with X86_64

Revision history for this message

GlenB (glenbirkbeck) wrote on 2009-03-03:

#66

I have been getting the same error for some time - I continue to boot from the last working version of the amd64 kernel (2.6.24-21). Whenever I do try and boot from the latest kernel version (currently a 2.6.27-11) I get the following error at boot:
PANIC: early exception 0e rip 10:ffffffff8022d611 error 0 cr2 ffffffffffffff5fc0f0

I had hoped that a subsequent kernel update would resolve it but it never has, so I continue to boot from the .24 kernel. My BIOS is details are as follows:

# dmidecode 2.9
SMBIOS 2.3 present.
35 structures occupying 1479 bytes.
Table at 0x000F0000.

Handle 0x0000, DMI type 0, 20 bytes
BIOS Information
Vendor: Phoenix Technologies, LTD
Version: 641W1P24
Release Date: 08/30/2006
Address: 0xE0000
Runtime Size: 128 kB
ROM Size: 512 kB

Is there any workaround or resolution to this issue?
Many thanks
Glen

Revision history for this message

Giuseppe Dia (giusedia) wrote on 2009-04-18:

#67

I can confirm the issue was solved for me with Intel BIOS Update [LF94510J.86A] on a Intel® Desktop Board D945GCLF2
Linux ***** 2.6.27-11-generic #1 SMP Wed Apr 1 20:53:41 UTC 2009 x86_64 GNU/Linux

Amit Kucheria (amitk) on 2009-04-19

Changed in linux (Ubuntu):
assignee:	Amit Kucheria (amitk) → nobody

Revision history for this message

Adam Thompson (athompso) wrote on 2009-05-14:

#68

Also confirming that with Intel BIOS LF94510J.86A, Ubuntu 9.04 (64-bit) installs and operates correctly.
Note that even with updated BIOS, I am unable to correctly boot or operate the 8.10 OS release.
On the other hand, going WAY back to an old v6 disc I found, works fine (presumably due to the *lack* of ACPI support in that vintage).

I appears that a workaround is still desirable and possible for the 8.x LTS stream, and likely for the 9.x stream also - we have multiple hardware vendors represented in this tree of bug reports, not all of whom are providing updated BIOSes. The problem appears to be isolated to ACPI initialization: booting non-ACPI-aware kernels seems to work just fine, as do other OSes that do not rely on the ACPI DTs for hardware initialization, e.g. older OpenBSD and NetBSD, DOS, etc.

I think this should be re-escalated to kernel development; there's already precedent for in-kernel ACPI fixups (e.g. ASUS P4B-DS motherboard, off the top of my head) and it looks like this will likely affect a wide variety of Atom-based systems. Odd that it's Atom-specific, but chipset-independent, though.

Ubuntu
linux package

x86_64 kernel oops on boot (dual-core Atom 330 board D945GCLF2)

Bug Description

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntulinux package

x86_64 kernel oops on boot (dual-core Atom 330 board D945GCLF2)

Bug Description

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package