[Geode LX] [ION603] kernels >= 2.6.31 fail to boot [initramfs]

Bug #396286 reported by Martin-Éric Racine on 2009-07-06
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned
Nominated for Karmic by Martin-Éric Racine
Nominated for Lucid by Martin-Éric Racine
Nominated for Maverick by Martin-Éric Racine

Bug Description

linux-image-2.6.31-2-generic oops on this FIC ION 603 (Geode LX800), right near the end of executing the content of the initramfs.

Reverting to linux-image-2.6.30-10-generic works; the system boots all the way to GDM as expected.

ProblemType: Bug
Architecture: i386
Date: Tue Jul 7 01:12:41 2009
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=5ffade8f-b837-49eb-bb44-225617349ca3
Lsusb:
 Bus 001 Device 004: ID 0ace:1215 ZyDAS WLA-54L WiFi
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 003: ID 03f9:0100 KeyTronic Corp. Keyboard
 Bus 002 Device 002: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: First International Computer, Inc. ION603
Package: linux-image-2.6.31-2-generic 2.6.31-2.15
ProcCmdLine: root=UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 ro quiet splash
ProcEnviron:
 PATH=(custom, user)
 LANG=fi_FI.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.30-10.12-generic
RelatedPackageVersions: linux-backports-modules-2.6.30-10-generic N/A
SourcePackage: linux
Uname: Linux 2.6.30-10-generic i586
dmi.bios.date: 11/08/2007
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: ION603
dmi.board.vendor: First International Computer, Inc.
dmi.board.version: PCB 2.X
dmi.chassis.type: 3
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd11/08/2007:svnFirstInternationalComputer,Inc.:pnION603:pvrVER2.X:rvnFirstInternationalComputer,Inc.:rnION603:rvrPCB2.X:cvn:ct3:cvr:
dmi.product.name: ION603
dmi.product.version: VER 2.X
dmi.sys.vendor: First International Computer, Inc.

Martin-Éric Racine (q-funk) wrote :

Architecture: i386
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=5ffade8f-b837-49eb-bb44-225617349ca3
Lsusb:
 Bus 002 Device 003: ID 03f9:0100 KeyTronic Corp. Keyboard
 Bus 002 Device 002: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 004: ID 0ace:1215 ZyDAS WLA-54L WiFi
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: First International Computer, Inc. ION603
Package: linux-image-2.6.31-2-generic 2.6.31-2.16
PackageArchitecture: i386
ProcCmdLine: root=UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 ro quiet splash
ProcEnviron:
 SHELL=/bin/bash
 PATH=(custom, user)
 LANG=fi_FI.UTF-8
ProcVersionSignature: Ubuntu 2.6.30-10.12-generic
RelatedPackageVersions: linux-backports-modules-2.6.30-10-generic N/A
Uname: Linux 2.6.30-10-generic i586
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare sudo
dmi.bios.date: 11/08/2007
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: ION603
dmi.board.vendor: First International Computer, Inc.
dmi.board.version: PCB 2.X
dmi.chassis.type: 3
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd11/08/2007:svnFirstInternationalComputer,Inc.:pnION603:pvrVER2.X:rvnFirstInternationalComputer,Inc.:rnION603:rvrPCB2.X:cvn:ct3:cvr:
dmi.product.name: ION603
dmi.product.version: VER 2.X
dmi.sys.vendor: First International Computer, Inc.

Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :

It appears that using "apport-collect -p linux-image-2.6.31-2-generic 396286" provided the logs from booting using the last good kernel (2.6.30) rather than the one from the failed log.

Is there any way to dump the log for the kernel that fails during the initramfs stage instead?

Andy Whitcroft (apw) wrote :

If you are getting an oops in initramfs and not booting then no you won't get to a place where you can easily take an apport-collect. You normally will see the panic on the screen or can get it there with the dmesg command. If so often a digital photo is an effective solution here.

Martin-Éric Racine (q-funk) wrote :

Here's a screenshot of what I get on a 80x60 console.

Martin-Éric Racine (q-funk) wrote :

Still not fixed as of linux-image-2.6.31-4-generic. Is there any missing information that I can attach to this bug?

Martin-Éric Racine (q-funk) wrote :

Someone on the LKML reported successful booting on fairly similar hardware, when running a vanilla kernel compiled with the following .config options.

I would have loved to compare this with Ubuntu's kernel config to help track the source of this issue, except that /boot/config-2.6.31-4-generic only is a partial config, because Ubuntu uses config splitter to prepare its build targets, and /proc/config.gz is not enabled on Ubuntu kernels. :(

I still hope that the above config can be of use to the Ubuntu kernel team to try and track the source of the issue. :)

Martin-Éric Racine (q-funk) wrote :

As requested by Leann Ogasawara:

I tested linux-image-2.6.31-020631rc5-generic (2.6.31-020631rc5)
from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31-rc5/

I get the same kernel panic as above.

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
tags: added: regression-potential

Martin noted he's also using EXT3.

I'm working with Martin to do a rough bisect right now.

Changed in linux (Ubuntu):
assignee: nobody → Leann Ogasawara (leannogasawara)
Changed in linux (Ubuntu):
status: Triaged → In Progress
Martin-Éric Racine (q-funk) wrote :

To continue the series of mainline kernel tests Leann suggested:

2.6.30-020630-generic: works fine.
2.6.31-020631rc1gc0d1117-generic: kernel panic.

summary: - kernel 2.6.31-generic oops after loading initramfs
+ 2.6.31-generic: kernel panic near the end of initramfs execution
summary: - 2.6.31-generic: kernel panic near the end of initramfs execution
+ 2.6.31-generic: kernel panic near the end of initramfs run
summary: - 2.6.31-generic: kernel panic near the end of initramfs run
+ 2.6.31-generic: kernel panic near the end of initramfs

Hi Martin-Éric,

Thanks for testing and the feedback. We're going to try to put together some additional test kernels for you to try to continue bisecting between 2.6.30 and 2.6.31-rc1. We'll let you know when they're ready.

Linux geode 2.6.30-999-generic #200908041153 SMP Tue Aug 4 12:48:19 UTC 2009 i586

This one booted successfully. Hurray!

I'm curious, what was the change that enabled it? Could someone attach a unified diff?

Martin-Éric Racine (q-funk) wrote :

FYI, this is the kernel module set that is pulled in by udev. I thought that it might be useful to add it here.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908041829 SMP Wed Aug 5 08:58:04 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908051216 SMP Wed Aug 5 11:59:01 UTC 2009 i586

Boots successfully.

Thanks, I'll queue up the next one. Will post when we have an image.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908061755 SMP Thu Aug 6 17:39:31 UTC 2009 i586

Boots successfully.

Thanks for the quick testing and feedback. Queuing next build.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908071146 SMP Fri Aug 7 11:29:56 UTC 2009 i586

Boots successfully.

While waiting for the final test build, might not hurt to verify this remains with the latest 2.6.31-5 kernel. Thanks.

Martin-Éric Racine (q-funk) wrote :

2.6.31-5 has already been tested, as all other 2.6.31 that get pulled by linux-generic. Kernel panic.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908071658 SMP Fri Aug 7 16:40:56 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

2.6.30-999.200908110142 does NOT boot.

It also seems to fail at an earlier stage than 2.6.31-5 does. See enclosed snapshot.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908112132 SMP Tue Aug 11 21:12:50 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908121741 SMP Wed Aug 12 17:22:08 UTC 2009 i586

Boots successfully.

Changed in linux:
status: Unknown → Confirmed
Martin-Éric Racine (q-funk) wrote :

2.6.30-999.200908122359 does NOT boot. Snapshot attached.

Martin-Éric Racine (q-funk) wrote :

Ingo Molnar pointed out that the only Geode-specific commit he can spot is this one:

d6c585a: x86: geode: Mark mfgpt irq IRQF_TIMER to prevent resume failure

Could this be our suspect?

Martin-Éric Racine (q-funk) wrote :

Here's a larger snapshot of what I get with linux-image-2.6.31-5-generic version 2.6.31-5.24 (based on upstream 2.6.31-rc5), thanks to vesafb and a 1280x1024 framebuffer. The main advantage over the initial snapshot is that it fits more lines of the crash into the visible area.

Next image:

http://kernel.ubuntu.com/~ogasawara/mainline/daily/lp396286/bisectf21f622/linux-image-2.6.30-999-generic_2.6.30-999.200908131643_i386.deb

I unfortunately don't see the commit Ingo pointed out in the remaining list of commits we're bisecting.

Martin-Éric Racine (q-funk) wrote :

Someone else said on the LKML:

> > http://launchpadlibrarian.net/30267494/2.6.31-5.24.jpg
>
> Hmm. This looks like a sysfs oops to my untrained eye.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908131643 SMP Thu Aug 13 16:25:22 UTC 2009 i586

Boots successfully.

Martin-Éric Racine (q-funk) wrote :

2.6.30-999.200908132259 does NOT boot. Snapshot attached.

Martin-Éric Racine (q-funk) wrote :

PS: would it be possible to include vesafb as a module in all test kernels? Thank you!

Hrm, that's odd. vesafb should be getting built as a module for the test kernels, but apparently it isn't happening as you've noted. We'll investigate.

So I'm told we're using the Jaunty config within our mainline build scripts. Mainly because if we used Karmic's config it would enable KMS if someone happened to install it within Jaunty. So there is definitely some discrepancy. Seeing as we have 1-2 more test builds to go I'd like to finish isolating the patch and then I'll build you a final mainline and Karmic test kernel with the patch reverted for you to confirm it is indeed the offending patch regardless of the different configs that we've been using to build.

Martin-Éric Racine (q-funk) wrote :

Noted and understood.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200908141609 SMP Fri Aug 14 15:53:21 UTC 2009 i586

Boots successfully.

Ok, the bisect has narrowed down the following:

f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 is first bad commit
commit f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0
Author: Al Viro <email address hidden>
Date: Mon Jun 8 19:50:45 2009 -0400

    add caching of ACLs in struct inode

    No helpers, no conversions yet.

    Signed-off-by: Al Viro <email address hidden>

Martin-Éric Racine (q-funk) wrote :

What if we reverse that specific commit against 2.4.31-rc6, as a test (reverse-apply the change as a patch)?

Martin-Éric Racine (q-funk) wrote :

New snapshot, showing the current kernel panic on 2.6.31-6.

Martin-Éric Racine (q-funk) wrote :

Still not fixed as of 2.6.31-7. Panic output is the same as in 2.6.31-6.

Martin-Éric Racine (q-funk) wrote :

Still not fixed as of 2.6.31-9. Panic output similar to 2.6.31-6.

Hi,

Can you try the following kernel build?

http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/2009-06-19a/

It's a build of f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 (the first bad commit). I'm expecting it to crash. However it'll confirm http://lkml.org/lkml/2009/8/16/252:

  f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 crashes
  f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0~1 boots fine

f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0~1 was 3e63cbb1efca7dd3137de1bb475e2e068e38ef23 which you tested and confirmed was booting fine in comment #59.

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.30-999-generic #200909032144 SMP Thu Sep 3 21:35:39 UTC 2009 i586

Boots fine.

Bah, seeing you say it booted fine when I was expecting it to fail I looked and realized I had the wrong patch queued, ugh :( So you just tested f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0~1 and indeed re-confirmed what we already knew, that it boots fine. I'm sooo sorry I wasted your time on that one. I'm going to requeue f19d4a8fa6f9b6ccf54df0971c97ffcaa390b7b0 for reals this time . . .

Martin-Éric Racine (q-funk) wrote :

Indeed, doesn't boot.

Martin-Éric Racine (q-funk) wrote :

Still not fixed as of 2.6.31-10.31 a.k.a. upstream 2.6.31 final.

Martin-Éric Racine (q-funk) wrote :

Just to recap, this is on a host where / is an ext3 file system.

# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
# /dev/sda1
UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 / ext3 relatime,errors=remount-ro 0 1
# /dev/sda5
UUID=5ffade8f-b837-49eb-bb44-225617349ca3 none swap sw 0 0

At Stefan Bader's request, here is what /proc/cpuinfo says:

processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 10
model name : Geode(TM) Integrated Processor by AMD PCS
stepping : 2
cpu MHz : 497.996
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu de pse tsc msr cx8 sep pge cmov clflush mmx mmxext 3dnowext 3dnow up
bogomips : 995.99
clflush size : 32
power management:

Martin-Éric Racine (q-funk) wrote :

BUG: unable to handle kernel paging request at ffffb4ff
IP: [<c01f716b>] __destroy_inode+0x4b/0x80
*pde = 00810067 *pte = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/power/resume

Stefan Bader (smb) wrote :

Thanks Martin-Éric,

checking against the code this confirms that the bug occurs in __destroy_inode at the following position:

232 void __destroy_inode(struct inode *inode)
233 {
234 BUG_ON(inode_has_buffers(inode));
235 ima_inode_free(inode);
236 security_inode_free(inode);
237 fsnotify_inode_delete(inode);
238 #ifdef CONFIG_FS_POSIX_ACL
239 if (inode->i_acl && inode->i_acl != ACL_NOT_CACHED)
240 posix_acl_release(inode->i_acl); /* here */
241 if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
242 posix_acl_release(inode->i_default_acl);
243 #endif
244 }

In EAX is the address of i_acl, so it looks like it is (repeatably) 0xffffb4ff. In theory i_acl is either a pointer to an acl structure or 0xffffffff (ACL_NOT_CACHED) or 0x0 (uninitialized). The address causing the bug seems a bit high for being a valid pointer. But just to be completely sure I put a kernel to http://people.canonical.com/~smb/bug396286/ which tries to catch a double free case.
On the other side 0xffffb4ff might be caused by something either writing 0xb4ff into the first word (little endian) or 0xb4 at offset 1 into the area that holds the pointer. Before the change that added i_acl and i_default_acl, the last field was a private pointer. Could something (this would need to be an externally build module) still use the wrong header file?...
One thing to try next would be to check whether the other pointer is corrupted too. I try to get something sensible up and the post here.

Martin-Éric Racine (q-funk) wrote :

With that kernel, the result is:

BUG: unable to handle kernel paging request at ffffb4ff
IP: [<c01f5902>] __destroy_inode+0x72/0x110
*pde = 00817067 *pte = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/power/resume

Stefan Bader (smb) wrote :

Just as memo in order to remember it after the weekend: The second kernel was uploaded and did boot further but seemed to have other problems with apparmor then. Not sure whether this is related or not.
The only change was to move the i_acl and i_default_acl to the end of the inode structure, so that the i_private pointer comes to the relative offset it was before. So the corruption of that memory location might still happen but as it is used differently it does not lead to the immediate panic.

Stefan, could you submit those changes to the LKML, referring to the
above kernel.org bug number, but emphasizing that this might only be a
partial fix that only masks the real problem? I'm sure that Ingo
Molnar and Al Viro would have some constructive feedback.

Martin-Éric, I added some info to the upstream bug. I believe the relevant
people are subscribed there as well and I don't think that change really is
something near a solution as this only prevents the immediate visibility of the
corruption. It still might happen but go unnoticed as the private pointer might
be used differently (at other times). So I would not say it is a fix.

In order to hopefully find out more about this, I created a new kernel that will (if things go as intended) catch the corruption cases without crashes and also tries to gather more info. It replaces the other kernels at http://people.canonical.com/~smb/bug396286/. Can you try that and if it boots post the dmesg that gets produced? Or even if not, whatever can be seen on th screen. Thanks

Martin-Éric Racine (q-funk) wrote :

Stefan, sorry for not replying to this any sooner.

This one crashes in similar ways as before. However, there's one interesting development: fsck of the root filesystem succeeds in launching and it fixes errors. Then, the kernel crashes as follow:

BUG: unable to handle kernel paging request at ffffb4ff
IP: [<c01f595f>] __destroy_inode+0x6f/0x110
*pde = 00819067 *pte = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/power/resume

Martin-Éric Racine (q-funk) wrote :

Stefan, could you please attach your 2.6.31-10.32bug396286v2 diff to this bug?

Martin-Éric Racine wrote:
> Stefan, could you please attach your 2.6.31-10.32bug396286v2 diff to

I am traveling this week and unfortunately seem to have the patch on another
box. Getting that crash mean it is not really working as I intended it to do
and I have to have another look at it. But I won't get back until next Monday.

Stefan, we're already aware that simply shuffling the structure as you did for 2.6.31-10.32bug396286v2 probably only masks the real issue rather than fixes it, but it would already be a good start to attach this as a patch and to send it upstream for comments.

Stefan Bader (smb) wrote :

This is the patch used to avoid the crash by moving the new pointers to the end of the inode structure. Though I would think this won't give new reactions from upstream. They pretty much should have guessed this.

Stefan Bader (smb) wrote :

As for the last debug kernel. This unfortunately contained a copy and paste error which failed to check the i_default_acl pointer. Interestingly I would have expected this would be no problem as the previous tests seemed to indicate only the first of the two got corrupted. I am currently uploading a revised kernel build which hopefully goes without crashing. It will be at the known location in a few minutes).

Martin-Éric Racine (q-funk) wrote :

Linux geode 2.6.31-11-generic #37bug396286v1 SMP Tue Sep 29 13:43:37 UTC 2009 i586

Boots fine.

AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Audio [CS5535 Audio], device 0: CS5535 Audio [CS5535 Audio]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Audio [CS5535 Audio], device 0: CS5535 Audio [CS5535 Audio]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/pcmC0D0p', '/dev/snd/pcmC0D0c', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Audio'/'CS5535 Audio cs5535audio at 0xfe00, irq 11'
   Mixer name : 'Realtek ALC203 rev 0'
   Components : 'AC97a:414c4770'
   Controls : 33
   Simple ctrls : 21
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=5ffade8f-b837-49eb-bb44-225617349ca3
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 003: ID 046d:c00e Logitech, Inc. M-BJ58/M-BJ69 Optical Wheel Mouse
 Bus 002 Device 002: ID 03f9:0100 KeyTronic Corp. Keyboard
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: First International Computer, Inc. ION603
Package: linux 2.6.31.11.22
PackageArchitecture: i386
ProcCmdLine: root=UUID=97b2628b-28a5-49f2-85f7-495728b3bef8 ro vga=795 quiet splash crashkernel=384M-2G:64M,2G-:128M
ProcEnviron:
 SHELL=/bin/bash
 PATH=(custom, user)
 LANG=fi_FI.UTF-8
 LANGUAGE=fi_FI:fi:en_US:en
ProcVersionSignature: Ubuntu 2.6.31-11.37bug396286v1-generic
RelatedPackageVersions:
 linux-backports-modules-2.6.31-11-generic N/A
 linux-firmware 1.19
RfKill:

Uname: Linux 2.6.31-11-generic i586
UserGroups: adm admin audio cdrom dialout lpadmin operator plugdev pulse pulse-access sambashare staff sudo
WpaSupplicantLog:

dmi.bios.date: 11/08/2007
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: 6.00 PG
dmi.board.name: ION603
dmi.board.vendor: First International Computer, Inc.
dmi.board.version: PCB 2.X
dmi.chassis.type: 3
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvr6.00PG:bd11/08/2007:svnFirstInternationalComputer,Inc.:pnION603:pvrVER2.X:rvnFirstInternationalComputer,Inc.:rnION603:rvrPCB2.X:cvn:ct3:cvr:
dmi.product.name: ION603
dmi.product.version: VER 2.X
dmi.sys.vendor: First International Computer, Inc.

tags: added: apport-collected

PS: I just added dmesg.boot and dmesg.current using apport-collect, based on output from 37bug396286v1. I hope this provides useful information.

Stefan Bader (smb) wrote :

Hm, unfortunately my filename printing was not very successful. Though somehow it looks related to apparmor. Could you try to boot with "apparmor=0" and check dmesg for those bad pointer messages?

Akdo (menoft) wrote :

Hi, I've reported the duplicate bug after this one ( although I make some research before ) and I think I have more information on this issue, Bug #406484

We just have the same Wireless device ! Evil device !

Bus 001 Device 002: ID 0ace:1215 ZyDAS WLA-54L 802.11bg

Martin-Éric Racine (q-funk) wrote :

Akdo, your issue is completely unrelated to this one.

Martin-Éric Racine (q-funk) wrote :

Stefan, adding "apparmor=0" to cmdline did not produce any noticable change:

[ 6.627144] EXT3-fs: mounted filesystem with writeback data mode.
[ 8.308111] bad i_default_acl pointer = ffffb4ff
[ 8.308133] on
[ 8.308689] bad i_default_acl pointer = ffffb4ff
[ 8.308705] on
[ 8.317678] bad i_default_acl pointer = ffffb4ff
[ 8.317697] on

Martin-Éric Racine wrote:
> Stefan, adding "apparmor=0" to cmdline did not produce any noticable
> change:

Alright, it was probably unlikely as this would have affected more people. And
the same code seems to run well on other systems. But it was worth a try, just
as it happens so close to those messages.
I need to rework the part that tries to find and print the associate file.
Maybe that gives a better indication. Not sure how quickly I get that done, though.

Hi Martin-Éric, looks like at the time of destroy_inode there is no path information left anymore. So I created a new version which checks on every inode access. Maybe this gives a little more insight. Could you try to run the v2 version of the kernel for me? Thanks

Stefan Bader (smb) wrote :

I might as well assign that to me by now.

Changed in linux (Ubuntu):
assignee: Leann Ogasawara (leannogasawara) → Stefan Bader (stefan-bader-canonical)
Martin-Éric Racine (q-funk) wrote :

Stefan, thanks for the updated kernel. I'll test this shortly and report here on the results. Meanwhile, could this please be re-based against kernel 2.6.31-12-generic while we're at it?

Martin-Éric Racine (q-funk) wrote :

Occurrences spotted with 37v2:

<4>[ 3.180243] sda1 sda2 < sda5 >
<5>[ 3.209382] sd 0:0:0:0: [sda] Attached SCSI disk
<6>[ 3.209462] Freeing unused kernel memory: 540k freed
<6>[ 3.210813] Write protecting the kernel text: 4548k
<6>[ 3.211102] Write protecting the kernel read-only data: 1840k
<6>[ 3.588117] usb 2-3: new low speed USB device using ohci_hcd and address 2
<6>[ 3.798731] usb 2-3: configuration #1 chosen from 1 choice
<3>[ 3.917274] bad i_default_acl pointer = ffffb4ff
<3>[ 3.917314] ipath /lib/udev/rules.d
<6>[ 4.124184] usb 2-4: new low speed USB device using ohci_hcd and address 3
<6>[ 4.349780] usb 2-4: configuration #1 chosen from 1 choice

...

<6>[ 6.809857] kjournald starting. Commit interval 5 seconds
<6>[ 6.809915] EXT3-fs: mounted filesystem with writeback data mode.
<5>[ 8.264349] type=1505 audit(1254918705.672:2): operation="profile_load" pid=321 name=/sbin/dhclient3
<5>[ 8.265847] type=1505 audit(1254918705.672:3): operation="profile_load" pid=321 name=/usr/lib/NetworkManager/nm-dhcp-client.action
<5>[ 8.266749] type=1505 audit(1254918705.672:4): operation="profile_load" pid=321 name=/usr/lib/connman/scripts/dhclient-script
<5>[ 8.409711] type=1505 audit(1254918705.816:5): operation="profile_load" pid=322 name=/usr/bin/evince
<5>[ 8.440386] type=1505 audit(1254918705.848:6): operation="profile_load" pid=322 name=/usr/bin/evince-previewer
<5>[ 8.454846] type=1505 audit(1254918705.860:7): operation="profile_load" pid=322 name=/usr/bin/evince-thumbnailer
<5>[ 8.495981] type=1505 audit(1254918705.900:8): operation="profile_load" pid=323 name=/usr/lib/cups/backend/cups-pdf
<5>[ 8.498157] type=1505 audit(1254918705.904:9): operation="profile_load" pid=323 name=/usr/sbin/cupsd
<5>[ 8.515403] type=1505 audit(1254918705.920:10): operation="profile_load" pid=324 name=/usr/sbin/tcpdump
<3>[ 8.621156] bad i_default_acl pointer = ffffb4ff
<3>[ 8.621904] bad i_default_acl pointer = ffffb4ff
<3>[ 8.623094] bad i_default_acl pointer = ffffb4ff
<3>[ 8.636382] bad i_default_acl pointer = ffffb4ff
<3>[ 8.637708] bad i_default_acl pointer = ffffb4ff
<3>[ 8.638109] bad i_default_acl pointer = ffffb4ff
<3>[ 8.640120] bad i_default_acl pointer = ffffb4ff
<3>[ 8.641532] bad i_default_acl pointer = ffffb4ff
<3>[ 8.642390] bad i_default_acl pointer = ffffb4ff
<3>[ 8.652979] bad i_default_acl pointer = ffffb4ff
<6>[ 11.047595] udev: starting version 147
<7>[ 12.395906] cs5535_gpio: base=0x6100 mask=0xb003c66 major=251
<6>[ 12.403445] AMD Geode RNG detected

Martin-Éric Racine (q-funk) wrote :

Thanks for the full log. Unfortunately won't be able to look at it today. But
updated the kernel at least.

summary: - 2.6.31-generic: kernel panic near the end of initramfs
+ [Geode LX] [OLPC] 2.6.31-generic: kernel panic near the end of initramfs

I think this issue is preventing boot on the OLPC XO-1 hardware, however as the screen never unfreezes it is hard to be certain.

Stefan Bader (smb) wrote :

New kernel same place. Please try to boot and if it comes up add the dmesg as usual. Thanks.

Martin-Éric Racine (q-funk) wrote :

Nice, except that Geode really is an i386 architecture, so an amd4 kernel won't be of any use here. :)

Stefan Bader (smb) wrote :

Grr, wrong build chroot. Ok, on the other hand I'd have been surprised if there had not been anything gone wrong when doing things quickly in the evening. Correct architecture uploaded.

Martin-Éric Racine (q-funk) wrote :

dmesg attached.

Stefan Bader (smb) wrote :

That looks somewhat crazy. There is not a single error message in this dmesg. This is completely unexpected and actually weird. Just because this is so unbelievable, could you please also try the -14.46 v2 which I uploaded (and attach the dmesg here). And for the sake of completeness verify and confirm in this report, that a unmodified Ubuntu 2.6.31-14.46 kernel crashes on boot.

For explanation: the old debug used the current struct inode which looks like this:

struct inode {
  ...
  struct posix_acl *i_acl;
  struct posix_acl *i_default_acl;
  void *i_private;
}

With that we saw a corrupted value in i_default_acl. For the latest debug I added two dummy pointers before and after the acl pointers. So the structure looks like this:

struct inode {
  ...
  void *i_dbg1;
  struct posix_acl *i_acl;
  struct posix_acl *i_default_acl;
  void *i_dbg2;
  void *i_private;
}

The expected behavior would have been that by adding those pointers either i_acl or i_dbg2 or i_default_acl (depending on whether the corruption is relative to the start or the end or direct to i_default_acl) would see the corruption. But certainly not that nothing gets triggered.

Martin-Éric Racine (q-funk) wrote :

Here's dmesg for 46 v2.

Martin-Éric Racine (q-funk) wrote :

To compare, I tried booting with the stock 47 (46 is no longer available) and, sure enough, it freezes during boot as the other unpatched releases before.

tags: added: regression-karmic
Stefan Bader (smb) wrote :

As thie issue seems to vanish when we prod around with the size of the inode structure, this needs a bit more prodding around. I added two more test kernels to my peoples page:

2.6.31-14.38*v1: This one is the stock kernel, but with SMP disabled in config (which removes some code replacement magic)
2.6.31-14.38*v2: This has the bad pointer catcher without the padding pointers, but with additional information printed about the bad pointers.

Could you boot both of these and add the dmesg or the info that it crashes on boot for both? Thanks.

Martin-Éric Racine (q-funk) wrote :

2.6.31-14-generic_48+bug396286v1.txt attached.

Martin-Éric Racine (q-funk) wrote :

2.6.31-14-generic_48+bug396286v2.txt attached.

Martin-Éric Racine (q-funk) wrote :

Sorry, please disregard the previous 48v2 attachment. See this one instead.

Stefan Bader (smb) wrote :

So the setting of SMP has no effect here and looking at the addresses of the bad pointers, there seems to be no obvious pattern in those. Neither the offsets within the inode structure are really showing any suspicious placements. So one step further, I added some code to immediately check after the values are supposed to be set (a v3 in the usual place). Could you check that for me and post the resulting dmesg? Thanks.

Stefan Bader (smb) wrote :

This actually looks interesting. Maybe some light somewhere? :) Again the corruption seems not to have happened at all. And this time the structure was not modified. I only moved the init statements somewhat. But before getting too existed, could you take a go at v4 (and again post me the dmesg of that)? I hope this sheds a bit more light on it.

Stefan Bader (smb) wrote :

Somehow those results do not make really sense. In v4, the init code is back to the place it was before and the only difference between the code that results in corrupted pointers and this one (which does not show any corruptions at all), is that there were a few more calls for validating the pointers after the init function supposedly set the right values.
Now this really brings up the question what the heck is going on there. It happens without SMP, so this would rule out the option of a race condition. The same code works on different hardware (I run it without any problems). And running the same procedure on the Geode seems to produce different results by changing the codepath a bit, even without really changing the effective way things are done. I wonder whether the v5 if added (which does only a limited number of checks and no function call to do the check) does bring back the failure warnings or still runs without any output. Could you do, yet another, run and post the dmesg?

Martin-Éric Racine (q-funk) wrote :

As a point of information, I recently changed the fstab entries to state ext4, to benefit from the slightly faster performance that new features common to both ext3 and ext4 make possible. In principle, this should not change anything, since both ext3 and ext4 call the same common fsattr functions but, in case there was a cut&paste error that only affected ext3 and not ext4, this could have some impacts.

To make sure the change of the fstab did not have an impact, go back to the v2 version
(which previously showed the corruption) and check whether you find bad acl messages
now. If not, then it would be most interesting to get back to the old state.

Stefan Bader (smb) wrote :

So both (ext3 and ext4) show the corruption messages when running with v2, but v5 never hits them. For better understanding I am attaching the diff between v2 and v5. As one can see there is no real change in that. In __iget there is just a printk added for the case that inode_check_acl() finds something. And all the changes to init_inode() just query i_acl and i_default_acl, which usually are set in inode_init_always(). There are just two cases where the acl values are not initialized and in both the test in init_inde() should then return NULL. And of course either way I would expect either a debug message here or at least hitting the problem in destroy_inode(). But as soon as this code is added, all problems go away. This is something I have a hard time to explain.

Martin-Éric Racine (q-funk) wrote :

Just to confirm, this issue still applies to 2.6.32-2-generic.

Stefan, how about attaching all of your your diffs to the upstream bug and asking the LKML for advice? I think that you and Leann have already done a fine job of narrowing down the issue and, at this point, the authors of the upstream code really need to step in and contribute their share in fixing this regression.

Martin-Éric Racine (q-funk) wrote :

I'll also add that a Debian developer (dilinger) who is also a member of the OLPC kernel team would be willing to help, but he cannot do much until you've attached your diffs to the upstream bug.

Stefan Bader (smb) wrote :

I added the two patches with some comment to the upstream bug report. At this point I guess it would be interesting to have a second confirmation with a different (but same model) Geode to rule out a single misbehaving hardware problem. I heard of others saying they have problems, but were those the exactly same crash with the acl pointers corrupted in that particular way?

tags: added: regression-release
removed: regression-potential
tags: added: karmic
Martin-Éric Racine (q-funk) wrote :

Added tags for Lucid, since this issue is still unresolved and will blow up in people's faces when they upgrade from Hardy.

tags: added: lucid regression-lucid
Martin-Éric Racine (q-funk) wrote :

I'm really wondering what to do about this one since LKML has been rather uncooperative and yet it already affects those upgrading from Jaunty to Karmic. However, the real concern is for LTS->LTS+1 upgrades. Geode support in Hardy is rock-solid, whereas this show-stopper affects Lucid.

Martin-Éric Racine (q-funk) wrote :

Peter Anvin suggested in the upstream report that enforcing -march=i386 as compiler options might be all that's required to fix this. Could new test packages be built using the following patch?

http://git.kernel.org/tip/17a2a9b57a9a7d2fd8f97df951b5e63e0bd56ef5

Martin-Éric Racine (q-funk) wrote :

Repeatedly trying to rebuild 2.6.32-9-generic with Peter Anvin's patch following instructions at https://help.ubuntu.com/community/Kernel/Compile consistently fails this way:

  CC arch/x86/kernel/alternative.o
  CC arch/x86/kernel/i8253.o
  CC arch/x86/kernel/pci-nommu.o
  CC arch/x86/kernel/tsc.o
  CC arch/x86/kernel/io_delay.o
  CC arch/x86/kernel/rtc.o
  CC arch/x86/kernel/trampoline.o
  CC arch/x86/kernel/process.o
arch/x86/kernel/process.o: final close failed: File truncated
make[5]: *** [arch/x86/kernel/process.o] Error 1
make[4]: *** [arch/x86/kernel] Error 2
make[3]: *** [arch/x86] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [/home/q-funk/Projektit/linux-2.6.32/debian/stamps/stamp-build-generic] Error 2
make: *** [binary-generic] Error 2

This is on a system with 1 GB of RAM, so I'm really not sure how this "file truncated" keeps on showing up.

Stefan Bader (smb) wrote :

Placed test kernel (2.6.31-17.54 + patch mentioned by hpa in the upstream bug) to http://people.canonical.com/~smb/bug396286/

Martin-Éric Racine (q-funk) wrote :

Tested. Crashes as before.

Could we apply this and your extra debug message patch to something 2.6.32 as well and build test packages with that? It seems that 2.6.31 cannot work with Plymouth and some other novelties found in Lucid, plus upstream probably wants us to try against the latest and greatest.

Stefan Bader (smb) wrote :

Uploaded linux-image-2.6.32-10-generic_2.6.32-10.14+bug396286v2_i386.deb to http://people.canonical.com/~smb/bug396286/

Martin-Éric Racine (q-funk) wrote :

2.6.32-10.14+bug396286v2 boots. dmesg attached.

Martin-Éric Racine (q-funk) wrote :

I'm wondering if the patch that was used to produce 2.6.32-10.14+bug396286v2 could be added to the Lucid -generic kernel?

While I realize that it's not a proper fix, let's keep in mind that Lucid is the next LTS and, as such, the last thing we want is a massive wave of complaints from users of thin clients (most of which are based on some Geode variant) upgrading from Hardy that their whole classroom of LX800-based thin client devices can no longer boot since the upgrade from Hardy to Lucid.

This of course doesn't dispense us from finding the real cause of the issues and fixing it properly but, if anyone asks me, a piece of gaffer tape that somehow prevents a hardware management disaster from taking place is better than no solution at all.

Martin-Éric Racine (q-funk) wrote :

Leann? Stefan?

Martin-Éric Racine (q-funk) wrote :

It seems that we have some progress.

In an attempt to debug this issue, I compared notes with someone on Fedora for whom the same hardware works. As a test, I used their kernel 2.6.31 config (with a couple of small modifications to build specific drivers as built-in) as attached to build my own kernel using make-kpkg. Much to my amazement, this kernel boots fine, as long as I specify root=/dev/sda1 on the GRUB cmdline.

However, for some reason, kernel-package no longer creates an initrd.img, even when the --initrd option was specified to build the kernel-image target. Yet, as soon as I created one using "sudo update-initramfs -k 2.6.31.12-geodelx -c" and rebooted, the kernel failed to boot as before.

Just to be safe, I deleted the initrd and rebooted again, letting udev perform its work after /sbin/init has been launched by the kernel. Lo and behold, it worked again!

As such, it seems that something that gets included in the initramfs image is what messes with the ACL code and destroys some inodes and makes the kernel crash in a non-recoverable way.

Interestingly enough, we still get the previous error messages when booting with this barebone kernel, without an initrd.img, but the error is non-fatal. The output of dmesg -r is attached next.

Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :

Performing the same test as above (removal of initrd.img and modification of GRUB's menu.lst) using 2.6.32-14-generic further confirms that the issue is with something that gets included in the initramfs image.

The relevant GRUB menu excerpt:

title Ubuntu lucid (development branch), kernel 2.6.32-14-generic
kernel /boot/vmlinuz-2.6.32-14-generic rootfstype=ext4 root=/dev/sda1 ro vga=795 quiet splash
quiet

Martin-Éric Racine (q-funk) wrote :

The resulting raw dmesg output.

Martin-Éric Racine (q-funk) wrote :

Apparently, in all cases, something that touches sysfs does something nasty, but what?

Instructions on how to locate exactly which part of the initramfs image payload causes this are welcome.

Martin-Éric Racine (q-funk) wrote :

Stefan suggested trying mem=nopentium but this did not have any apparent effect. However raid=noautodetect did. When booting without any initramfs image, the kernel no longer shows any paging error or destroyed inode at all.

Martin-Éric Racine (q-funk) wrote :

Testing the mainline 2.6.32 kernel on other LX800-based hardware (Artec ThinCan DBE61C-USB), I notice that everything boots as normal.

It thus appears that some recent changes in the kernel might have succeeded in exposing BIOS issues on some specific hardware.

I suppose that we could choose to ignore this bug and move on, but doing so would pose a problem: an awful lot of Geode-based hardware sold by different hardware vendors are branded versions of this same FIC ION603. This includes the Linutop-2, Inveneo desktop, Koolu ... and many others that came with Ubuntu pre-installed.

Personally, I think that a more positive outcome would involve a combination of Canonical's technical support (who is known to have certified some of the above hardware for Ubuntu) and of some of the above vendors contacting First International Computers to work at finding a common solution together, which could possibly involve releasing an updated BIOS along with Ubuntu tools to flash the EPROM from command line.

summary: - [Geode LX] [OLPC] 2.6.31-generic: kernel panic near the end of initramfs
+ [Geode LX] [ION603] 2.6.31-generic: kernel panic near the end of
+ initramfs

FYI it appears that AMD decided to jump in and contact the FIC engineering team themselves. I'll keep everyone informed on any progress via this bug and the upstream one.

Martin-Éric Racine (q-funk) wrote :
Martin-Éric Racine (q-funk) wrote :

Here's the info provided by dmidecode, in case it can help someone figure out what is going on.

Martin-Éric Racine (q-funk) wrote :

Here's a snapshot of the BIOS splash, as requested.

summary: - [Geode LX] [ION603] 2.6.31-generic: kernel panic near the end of
- initramfs
+ [Geode LX] [ION603] kernels >= 2.6.31 fail to boot [initramfs]
tags: added: kernel-core kernel-reviewed
Stefan Bader (smb) wrote :

Setting this bug back to triaged as I don't know how to progress here and don't want to block anybody else to pick up if there is someone with better ideas.

Changed in linux (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: In Progress → Triaged
Jeremy Foshee (jeremyfoshee) wrote :

Changed status to WontFix per the dropping of support for Geode in Maverick. This is per discussion with the Ubuntu Kernel Team as to the ongoing status of this issue.

~JFo

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Martin-Éric Racine (q-funk) wrote :

It would be a very good idea for the kernel team to discuss this sort of far-reaching ideas with the user community BEFORE coming to a decision. It would also be worth noting that the decision is in direct contradiction with the existing decision of accommodating the Geode in libc6 compilation options.

Martin-Éric Racine (q-funk) wrote :

I'd like to point out that the linux-image-2.6.36-0-generic that popped up into Natty suddenly resolves this issue.

Meanwhile, the 2.6.35-22.34 that is currently pulled by linux-generic does not.

This seems to indicate that this has indeed been a kernel issue all along and NOT a BIOS issue!

It would be highly desirable to for the kernel team to find out exactly what fixed it and to backport the patch into Lucid.

Changed in linux (Ubuntu):
status: Won't Fix → Triaged
Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Changed in linux:
status: Confirmed → Invalid
Changed in linux:
importance: Unknown → Medium
Martin-Éric Racine (q-funk) wrote :

Fixed by 2.6.36-1-generic and broken again as of 2.6.38-1-generic, which makes 2.6.37-12 the last good kernel on this hardware.

Changed in linux:
status: Invalid → Confirmed
Martin-Éric Racine (q-funk) wrote :

Working again as of 2.6.38-7-generic.

Changed in linux:
status: Confirmed → Invalid
Changed in linux:
status: Invalid → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.