[Geode LX] [ION603] kernels >= 2.6.31 fail to boot [initramfs]

Reported by Martin-Éric Racine on 2009-07-06
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned
Nominated for Karmic by Martin-Éric Racine
Nominated for Lucid by Martin-Éric Racine
Nominated for Maverick by Martin-Éric Racine

Bug Description

linux-image-2.6.31-2-generic oops on this FIC ION 603 (Geode LX800), right near the end of executing the content of the initramfs.

Reverting to linux-image-2.6.30-10-generic works; the system boots all the way to GDM as expected.

ProblemType: Bug
Architecture: i386
Date: Tue Jul 7 01:12:41 2009
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=5ffade8f-b837-49eb-bb44-225617349ca3
Lsusb:
 Bus 001 Device 004: ID 0ace:1215 ZyDAS WLA-54L WiFi
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 003: ID 03f9:0100 KeyTronic Corp. Keyboard
 Bus 002 Device 002: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: First International Computer, Inc. ION603
Package: linux-image-2.6.31-2-generic 2.6.31-2.15
ProcCmdLine: root=UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 ro quiet splash
ProcEnviron:
 PATH=(custom, user)
 LANG=fi_FI.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.30-10.12-generic
RelatedPackageVersions: linux-backports-modules-2.6.30-10-generic N/A
SourcePackage: linux
Uname: Linux 2.6.30-10-generic i586
dmi.bios.date: 11/08/2007
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: ION603
dmi.board.vendor: First International Computer, Inc.
dmi.board.version: PCB 2.X
dmi.chassis.type: 3
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd11/08/2007:svnFirstInternationalComputer,Inc.:pnION603:pvrVER2.X:rvnFirstInternationalComputer,Inc.:rnION603:rvrPCB2.X:cvn:ct3:cvr:
dmi.product.name: ION603
dmi.product.version: VER 2.X
dmi.sys.vendor: First International Computer, Inc.

Martin-Éric Racine (q-funk) wrote :

Architecture: i386
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=5ffade8f-b837-49eb-bb44-225617349ca3
Lsusb:
 Bus 002 Device 003: ID 03f9:0100 KeyTronic Corp. Keyboard
 Bus 002 Device 002: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 004: ID 0ace:1215 ZyDAS WLA-54L WiFi
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: First International Computer, Inc. ION603
Package: linux-image-2.6.31-2-generic 2.6.31-2.16
PackageArchitecture: i386
ProcCmdLine: root=UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 ro quiet splash
ProcEnviron:
 SHELL=/bin/bash
 PATH=(custom, user)
 LANG=fi_FI.UTF-8
ProcVersionSignature: Ubuntu 2.6.30-10.12-generic
RelatedPackageVersions: linux-backports-modules-2.6.30-10-generic N/A
Uname: Linux 2.6.30-10-generic i586
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare sudo
dmi.bios.date: 11/08/2007
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: ION603
dmi.board.vendor: First International Computer, Inc.
dmi.board.version: PCB 2.X
dmi.chassis.type: 3
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd11/08/2007:svnFirstInternationalComputer,Inc.:pnION603:pvrVER2.X:rvnFirstInternationalComputer,Inc.:rnION603:rvrPCB2.X:cvn:ct3:cvr:
dmi.product.name: ION603
dmi.product.version: VER 2.X
dmi.sys.vendor: First International Computer, Inc.

Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :

It appears that using "apport-collect -p linux-image-2.6.31-2-generic 396286" provided the logs from booting using the last good kernel (2.6.30) rather than the one from the failed log.

Is there any way to dump the log for the kernel that fails during the initramfs stage instead?

Andy Whitcroft (apw) wrote :

If you are getting an oops in initramfs and not booting then no you won't get to a place where you can easily take an apport-collect. You normally will see the panic on the screen or can get it there with the dmesg command. If so often a digital photo is an effective solution here.

Martin-Éric Racine (q-funk) wrote :

Here's a screenshot of what I get on a 80x60 console.

Martin-Éric Racine (q-funk) wrote :

Still not fixed as of linux-image-2.6.31-4-generic. Is there any missing information that I can attach to this bug?

Martin-Éric Racine (q-funk) wrote :

Someone on the LKML reported successful booting on fairly similar hardware, when running a vanilla kernel compiled with the following .config options.

I would have loved to compare this with Ubuntu's kernel config to help track the source of this issue, except that /boot/config-2.6.31-4-generic only is a partial config, because Ubuntu uses config splitter to prepare its build targets, and /proc/config.gz is not enabled on Ubuntu kernels. :(

I still hope that the above config can be of use to the Ubuntu kernel team to try and track the source of the issue. :)

Martin-Éric Racine (q-funk) wrote :

As requested by Leann Ogasawara:

I tested linux-image-2.6.31-020631rc5-generic (2.6.31-020631rc5)
from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc5/

I get the same kernel panic as above.

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
tags: added: regression-potential

Martin noted he's also using EXT3.

I'm working with Martin to do a rough bisect right now.

Changed in linux (Ubuntu):
assignee: nobody → Leann Ogasawara (leannogasawara)
Changed in linux (Ubuntu):
status: Triaged → In Progress
Martin-Éric Racine (q-funk) wrote :

To continue the series of mainline kernel tests Leann suggested:

2.6.30-020630-generic: works fine.
2.6.31-020631rc1gc0d1117-generic: kernel panic.

summary: - kernel 2.6.31-generic oops after loading initramfs
+ 2.6.31-generic: kernel panic near the end of initramfs execution
summary: - 2.6.31-generic: kernel panic near the end of initramfs execution
+ 2.6.31-generic: kernel panic near the end of initramfs run
summary: - 2.6.31-generic: kernel panic near the end of initramfs run
+ 2.6.31-generic: kernel panic near the end of initramfs

Hi Martin-Éric,

Thanks for testing and the feedback. We're going to try to put together some additional test kernels for you to try to continue bisecting between 2.6.30 and 2.6.31-rc1. We'll let you know when they're ready.

Linux geode 2.6.30-999-generic #200908041153 SMP Tue Aug 4 12:48:19 UTC 2009 i586

This one booted successfully. Hurray!

I'm curious, what was the change that enabled it? Could someone attach a unified diff?

Martin-Éric Racine (q-funk) wrote :

FYI, this is the kernel module set that is pulled in by udev. I thought that it might be useful to add it here.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908041829 SMP Wed Aug 5 08:58:04 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908051216 SMP Wed Aug 5 11:59:01 UTC 2009 i586

Boots successfully.

Thanks, I'll queue up the next one. Will post when we have an image.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908061755 SMP Thu Aug 6 17:39:31 UTC 2009 i586

Boots successfully.

Thanks for the quick testing and feedback. Queuing next build.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908071146 SMP Fri Aug 7 11:29:56 UTC 2009 i586

Boots successfully.

While waiting for the final test build, might not hurt to verify this remains with the latest 2.6.31-5 kernel. Thanks.

Martin-Éric Racine (q-funk) wrote :

2.6.31-5 has already been tested, as all other 2.6.31 that get pulled by linux-generic. Kernel panic.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908071658 SMP Fri Aug 7 16:40:56 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

2.6.30-999.200908110142 does NOT boot.

It also seems to fail at an earlier stage than 2.6.31-5 does. See enclosed snapshot.

Changed in linux:
status: Unknown → Confirmed
tags: added: apport-collected
Stefan Bader (smb) on 2009-10-07
Changed in linux (Ubuntu):
assignee: Leann Ogasawara (leannogasawara) → Stefan Bader (stefan-bader-canonical)
summary: - 2.6.31-generic: kernel panic near the end of initramfs
+ [Geode LX] [OLPC] 2.6.31-generic: kernel panic near the end of initramfs
tags: added: regression-karmic
88 comments hidden view all 167 comments
Stefan Bader (smb) wrote :

This actually looks interesting. Maybe some light somewhere? :) Again the corruption seems not to have happened at all. And this time the structure was not modified. I only moved the init statements somewhat. But before getting too existed, could you take a go at v4 (and again post me the dmesg of that)? I hope this sheds a bit more light on it.

1 comments hidden view all 167 comments
Stefan Bader (smb) wrote :

Somehow those results do not make really sense. In v4, the init code is back to the place it was before and the only difference between the code that results in corrupted pointers and this one (which does not show any corruptions at all), is that there were a few more calls for validating the pointers after the init function supposedly set the right values.
Now this really brings up the question what the heck is going on there. It happens without SMP, so this would rule out the option of a race condition. The same code works on different hardware (I run it without any problems). And running the same procedure on the Geode seems to produce different results by changing the codepath a bit, even without really changing the effective way things are done. I wonder whether the v5 if added (which does only a limited number of checks and no function call to do the check) does bring back the failure warnings or still runs without any output. Could you do, yet another, run and post the dmesg?

Martin-Éric Racine (q-funk) wrote :

As a point of information, I recently changed the fstab entries to state ext4, to benefit from the slightly faster performance that new features common to both ext3 and ext4 make possible. In principle, this should not change anything, since both ext3 and ext4 call the same common fsattr functions but, in case there was a cut&paste error that only affected ext3 and not ext4, this could have some impacts.

1 comments hidden view all 167 comments

To make sure the change of the fstab did not have an impact, go back to the v2 version
(which previously showed the corruption) and check whether you find bad acl messages
now. If not, then it would be most interesting to get back to the old state.

2 comments hidden view all 167 comments
Stefan Bader (smb) wrote :

So both (ext3 and ext4) show the corruption messages when running with v2, but v5 never hits them. For better understanding I am attaching the diff between v2 and v5. As one can see there is no real change in that. In __iget there is just a printk added for the case that inode_check_acl() finds something. And all the changes to init_inode() just query i_acl and i_default_acl, which usually are set in inode_init_always(). There are just two cases where the acl values are not initialized and in both the test in init_inde() should then return NULL. And of course either way I would expect either a debug message here or at least hitting the problem in destroy_inode(). But as soon as this code is added, all problems go away. This is something I have a hard time to explain.

Martin-Éric Racine (q-funk) wrote :

Just to confirm, this issue still applies to 2.6.32-2-generic.

Stefan, how about attaching all of your your diffs to the upstream bug and asking the LKML for advice? I think that you and Leann have already done a fine job of narrowing down the issue and, at this point, the authors of the upstream code really need to step in and contribute their share in fixing this regression.

Martin-Éric Racine (q-funk) wrote :

I'll also add that a Debian developer (dilinger) who is also a member of the OLPC kernel team would be willing to help, but he cannot do much until you've attached your diffs to the upstream bug.

Stefan Bader (smb) wrote :

I added the two patches with some comment to the upstream bug report. At this point I guess it would be interesting to have a second confirmation with a different (but same model) Geode to rule out a single misbehaving hardware problem. I heard of others saying they have problems, but were those the exactly same crash with the acl pointers corrupted in that particular way?

tags: added: regression-release
removed: regression-potential
tags: added: karmic
Martin-Éric Racine (q-funk) wrote :

Added tags for Lucid, since this issue is still unresolved and will blow up in people's faces when they upgrade from Hardy.

tags: added: lucid regression-lucid
Martin-Éric Racine (q-funk) wrote :

I'm really wondering what to do about this one since LKML has been rather uncooperative and yet it already affects those upgrading from Jaunty to Karmic. However, the real concern is for LTS->LTS+1 upgrades. Geode support in Hardy is rock-solid, whereas this show-stopper affects Lucid.

Martin-Éric Racine (q-funk) wrote :

Peter Anvin suggested in the upstream report that enforcing -march=i386 as compiler options might be all that's required to fix this. Could new test packages be built using the following patch?

http://git.kernel.org/tip/17a2a9b57a9a7d2fd8f97df951b5e63e0bd56ef5

Martin-Éric Racine (q-funk) wrote :

Repeatedly trying to rebuild 2.6.32-9-generic with Peter Anvin's patch following instructions at https://help.ubuntu.com/community/Kernel/Compile consistently fails this way:

  CC arch/x86/kernel/alternative.o
  CC arch/x86/kernel/i8253.o
  CC arch/x86/kernel/pci-nommu.o
  CC arch/x86/kernel/tsc.o
  CC arch/x86/kernel/io_delay.o
  CC arch/x86/kernel/rtc.o
  CC arch/x86/kernel/trampoline.o
  CC arch/x86/kernel/process.o
arch/x86/kernel/process.o: final close failed: File truncated
make[5]: *** [arch/x86/kernel/process.o] Error 1
make[4]: *** [arch/x86/kernel] Error 2
make[3]: *** [arch/x86] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [/home/q-funk/Projektit/linux-2.6.32/debian/stamps/stamp-build-generic] Error 2
make: *** [binary-generic] Error 2

This is on a system with 1 GB of RAM, so I'm really not sure how this "file truncated" keeps on showing up.

Stefan Bader (smb) wrote :

Placed test kernel (2.6.31-17.54 + patch mentioned by hpa in the upstream bug) to http://people.canonical.com/~smb/bug396286/

Martin-Éric Racine (q-funk) wrote :

Tested. Crashes as before.

Could we apply this and your extra debug message patch to something 2.6.32 as well and build test packages with that? It seems that 2.6.31 cannot work with Plymouth and some other novelties found in Lucid, plus upstream probably wants us to try against the latest and greatest.

Stefan Bader (smb) wrote :

Uploaded linux-image-2.6.32-10-generic_2.6.32-10.14+bug396286v2_i386.deb to http://people.canonical.com/~smb/bug396286/

Martin-Éric Racine (q-funk) wrote :

2.6.32-10.14+bug396286v2 boots. dmesg attached.

Martin-Éric Racine (q-funk) wrote :

I'm wondering if the patch that was used to produce 2.6.32-10.14+bug396286v2 could be added to the Lucid -generic kernel?

While I realize that it's not a proper fix, let's keep in mind that Lucid is the next LTS and, as such, the last thing we want is a massive wave of complaints from users of thin clients (most of which are based on some Geode variant) upgrading from Hardy that their whole classroom of LX800-based thin client devices can no longer boot since the upgrade from Hardy to Lucid.

This of course doesn't dispense us from finding the real cause of the issues and fixing it properly but, if anyone asks me, a piece of gaffer tape that somehow prevents a hardware management disaster from taking place is better than no solution at all.

Martin-Éric Racine (q-funk) wrote :

Leann? Stefan?

Martin-Éric Racine (q-funk) wrote :

It seems that we have some progress.

In an attempt to debug this issue, I compared notes with someone on Fedora for whom the same hardware works. As a test, I used their kernel 2.6.31 config (with a couple of small modifications to build specific drivers as built-in) as attached to build my own kernel using make-kpkg. Much to my amazement, this kernel boots fine, as long as I specify root=/dev/sda1 on the GRUB cmdline.

However, for some reason, kernel-package no longer creates an initrd.img, even when the --initrd option was specified to build the kernel-image target. Yet, as soon as I created one using "sudo update-initramfs -k 2.6.31.12-geodelx -c" and rebooted, the kernel failed to boot as before.

Just to be safe, I deleted the initrd and rebooted again, letting udev perform its work after /sbin/init has been launched by the kernel. Lo and behold, it worked again!

As such, it seems that something that gets included in the initramfs image is what messes with the ACL code and destroys some inodes and makes the kernel crash in a non-recoverable way.

Interestingly enough, we still get the previous error messages when booting with this barebone kernel, without an initrd.img, but the error is non-fatal. The output of dmesg -r is attached next.

1 comments hidden view all 167 comments
Martin-Éric Racine (q-funk) wrote :

Performing the same test as above (removal of initrd.img and modification of GRUB's menu.lst) using 2.6.32-14-generic further confirms that the issue is with something that gets included in the initramfs image.

The relevant GRUB menu excerpt:

title Ubuntu lucid (development branch), kernel 2.6.32-14-generic
kernel /boot/vmlinuz-2.6.32-14-generic rootfstype=ext4 root=/dev/sda1 ro vga=795 quiet splash
quiet

Martin-Éric Racine (q-funk) wrote :

The resulting raw dmesg output.

Martin-Éric Racine (q-funk) wrote :

Apparently, in all cases, something that touches sysfs does something nasty, but what?

Instructions on how to locate exactly which part of the initramfs image payload causes this are welcome.

Martin-Éric Racine (q-funk) wrote :

Stefan suggested trying mem=nopentium but this did not have any apparent effect. However raid=noautodetect did. When booting without any initramfs image, the kernel no longer shows any paging error or destroyed inode at all.

Martin-Éric Racine (q-funk) wrote :

Testing the mainline 2.6.32 kernel on other LX800-based hardware (Artec ThinCan DBE61C-USB), I notice that everything boots as normal.

It thus appears that some recent changes in the kernel might have succeeded in exposing BIOS issues on some specific hardware.

I suppose that we could choose to ignore this bug and move on, but doing so would pose a problem: an awful lot of Geode-based hardware sold by different hardware vendors are branded versions of this same FIC ION603. This includes the Linutop-2, Inveneo desktop, Koolu ... and many others that came with Ubuntu pre-installed.

Personally, I think that a more positive outcome would involve a combination of Canonical's technical support (who is known to have certified some of the above hardware for Ubuntu) and of some of the above vendors contacting First International Computers to work at finding a common solution together, which could possibly involve releasing an updated BIOS along with Ubuntu tools to flash the EPROM from command line.

summary: - [Geode LX] [OLPC] 2.6.31-generic: kernel panic near the end of initramfs
+ [Geode LX] [ION603] 2.6.31-generic: kernel panic near the end of
+ initramfs

FYI it appears that AMD decided to jump in and contact the FIC engineering team themselves. I'll keep everyone informed on any progress via this bug and the upstream one.

1 comments hidden view all 167 comments
Martin-Éric Racine (q-funk) wrote :

Here's the info provided by dmidecode, in case it can help someone figure out what is going on.

Martin-Éric Racine (q-funk) wrote :

Here's a snapshot of the BIOS splash, as requested.

summary: - [Geode LX] [ION603] 2.6.31-generic: kernel panic near the end of
- initramfs
+ [Geode LX] [ION603] kernels >= 2.6.31 fail to boot [initramfs]
tags: added: kernel-core kernel-reviewed
Stefan Bader (smb) wrote :

Setting this bug back to triaged as I don't know how to progress here and don't want to block anybody else to pick up if there is someone with better ideas.

Changed in linux (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: In Progress → Triaged
Jeremy Foshee (jeremyfoshee) wrote :

Changed status to WontFix per the dropping of support for Geode in Maverick. This is per discussion with the Ubuntu Kernel Team as to the ongoing status of this issue.

~JFo

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Martin-Éric Racine (q-funk) wrote :

It would be a very good idea for the kernel team to discuss this sort of far-reaching ideas with the user community BEFORE coming to a decision. It would also be worth noting that the decision is in direct contradiction with the existing decision of accommodating the Geode in libc6 compilation options.

Martin-Éric Racine (q-funk) wrote :

I'd like to point out that the linux-image-2.6.36-0-generic that popped up into Natty suddenly resolves this issue.

Meanwhile, the 2.6.35-22.34 that is currently pulled by linux-generic does not.

This seems to indicate that this has indeed been a kernel issue all along and NOT a BIOS issue!

It would be highly desirable to for the kernel team to find out exactly what fixed it and to backport the patch into Lucid.

Changed in linux (Ubuntu):
status: Won't Fix → Triaged
Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Changed in linux:
status: Confirmed → Invalid
Changed in linux:
importance: Unknown → Medium
Martin-Éric Racine (q-funk) wrote :

Fixed by 2.6.36-1-generic and broken again as of 2.6.38-1-generic, which makes 2.6.37-12 the last good kernel on this hardware.

Changed in linux:
status: Invalid → Confirmed
Martin-Éric Racine (q-funk) wrote :

Working again as of 2.6.38-7-generic.

Changed in linux:
status: Confirmed → Invalid
Changed in linux:
status: Invalid → Confirmed
Displaying first 40 and last 40 comments. View all 167 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.