maverick toolchain producing unbootable (hanging) kernels

Bug #673236 reported by m4t
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
New
Undecided
Unassigned
binutils (Ubuntu)
Fix Released
Undecided
Unassigned
Maverick
Won't Fix
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Maverick
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: gcc-4.4

when attempting to compile recent vanilla kernels from source (2.6.35.7, 2.6.35.8, 2.6.36), the resulting bzImages all hang at the same point in the boot process. this doesn't seem to be an error with the kernel config or source, because using identical kernel source and .config produces working kernels when compiled inside a debootstrap'd lucid chroot.

i had originally suspected this had something to do with a specific cpu setting (CONFIG_MPENTIUMM), but this also happens after changing to the 'generic' CONFIG_M686. i also suspected that perhaps something was broken with my install. however, a bzImage produced remotely from someone's fresh 32bit 10.10 install also hung at the exact same place during boot. compiling using ubuntu's generic .config and patchset (linux-image-2.6.35-22-generic) produced a working kernel.

the boot failure occurs both on real hardware (ibm t42p) as well as inside of qemu. the broken bzImages hang at rtc_cmos in qemu, and at ehci initialization on real hw. the working bzImages from lucid (as they ought to) end with a kernel panic due to missing rootfs.

the broken bzImages are produced with both kernel-package 12.036 and a simple 'make' in the kernel source tree.

i had some luck using bzImage/vmlinux with qemu/gdb to see where it was stuck at, but now the symbols don't seem to match up. i would otherwise provide this info.

this problem has only appeared since installing 10.10. 8.10, 9.04, 9.10, and 10.04 did not present this issue. lucid's build environment, used from a chroot inside maverick, also does not present this issue. although this bug has been placed under gcc-4.4, maverick's gcc-4.5 package has the same issue.

running gcc 4.4.4's testsuite shows failure in several places. the binutils testsuite shows no failures.

i've attached a minimized 2.6.36 .config which produces a broken bzImage.

any insight appreciated.

thanks,
-matt

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: gcc-4.4 4.4.4-14ubuntu5
Uname: Linux 2.6.36deep-thought i686
Architecture: i386
Date: Tue Nov 9 15:28:56 2010
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release i386 (20101007)
ProcEnviron:
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: gcc-4.4

Revision history for this message
m4t (m4t) wrote :
Revision history for this message
m4t (m4t) wrote :

after some further narrowing of kernel config, it appears that 'CONFIG_RELOCATABLE=y' is causing the kernel to become unbootable with maverick's 32bit toolchain. unsetting this parameter produces a bootable kernel on maverick. again, lucid toolchain produces bootable kernels with or without 'CONFIG_RELOCATABLE=y'.

so this is what's causing it in the kernel config. according to LKDDB, this parameter does:

'The kernel is linked as a position-independent executable (PIE) and contains dynamic relocations which are processed early in the bootup process.'

i'm able to reproduce this consistently now, by toggling CONFIG_RELOCATABLE on and off in kernel compiles.

thanks,
-matt

Revision history for this message
m4t (m4t) wrote :

this actually seems to be directly related to the maverick binutils package, binutils_2.20.51.20100908-0ubuntu2_i386.deb

replacing lucid's binutils with this inside a chroot causes the same boot failures with CONFIG_RELOCATABLE=y, when used alongside the rest of lucid's toolchain.

-matt

Revision history for this message
m4t (m4t) wrote :

further investigation shows that maverick binutils is more likely the culprit than maverick gcc.

affects: gcc-4.4 (Ubuntu) → binutils (Ubuntu)
Revision history for this message
m4t (m4t) wrote :

brief testing of natty's binutils_2.20.90.20101105-0ubuntu1_i386.deb shows the bug is also present there.

-matt

Revision history for this message
Matthias Klose (doko) wrote :

please could you check with binutils from Debian/experimental too? The last armel build in natty didn't pass the testsuite.

Revision history for this message
m4t (m4t) wrote :

i tested with debian experimental (2.20.90.20101105-1) in lucid chroot with same results.

i've attached a very minimal .config for 2.6.36 that results in a hang at 'Calibrating delay loop...' when run like: qemu -cpu coreduo -m 512M -kernel arch/x86/boot/bzImage -s

unsetting CONFIG_RELOCATABLE with the attached config will panic at 'VFS: Unable to mount root fs' like it ought to.

thanks,
-matt

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

I can confirm this problem, although my symptoms are slightly different.

I've been using vanilla kernels (tracking kernel.org) since Lucid on my x86 32-bit laptop.
After upgrading to Maverick, subsequent built kernels all failed to boot, panicing early with a message about timer IRQ routing and the APIC.

After spending a few days (:-/) bisecting, because I thought a new revision must be the cause, I eventually found the exact same git revision and kernel config failed to boot, compared with a previously built kernel that is fine (using it at the moment).

I downgraded all the GCC packages to Lucid's version, and the resulting build still failed to boot, in the same way.

Then I downgraded Binutils to Lucid's version, rebuilt, and the resulting kernel (a) booted fine (using it at the moment), and (b) had the same symbol map as the kernel built previously before I upgraded Lucid to Maverick.

This means Binutils is definitely affecting the kernel build, in a way which causes various boot failures.

Comments earlier in this bug, my different symptom, and a mention elsewhere (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/633983/comments/36), imply that the symptoms aren't consistent, except that booting fails.

The variation in symptoms is entirely consistent with CONFIG_RELOCATABLE having something to do with it. That is, it looks like random code corruption ;-)

When I compare the kernel's Symbol.map built with Lucid vs. Maverick's binutils, and Lucid's gcc (for both), with my particular kernel + config, I see this:

    - The version built with Maverick's binutils has slightly larger kallsyms tables (just a few bytes).
      Everything else in Symbol.map looks to be the same size.
    - In arch/x86/boot/compressed/vmlinux, the ELF section sizes (objdump -h) are the same size
      in both versions, except .rodata..compressed is about 1.6k smaller in the version
      built with Maverick.
    - In arch/x86/boot/compressed/vmlinux.bin, the ELF section sizes are the same size, except
      in the Maverick-binutils-built version, .rodata is 16 bytes smaller, .param is 32 bytes smaller,
      and .init.data is 32 byte larger.

I'm still looking at the reason for the differences.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Ok, I've found the problem. Binutils changed the meaning of symbol assignments in linker scripts. It is actually documented, but rather obscurely in /usr/share/doc/binutils/ld/NEWS.gz.

This breaks the relocation of "jiffies". It also makes got.plt go away in each vdso (a small improvement).

This means there *isn't* a binutils bug.

Upstream kernels need a small patch, which I have got ready with some proper explanation and will send after I've had some sleep.

Matthias Klose (doko)
Changed in binutils (Ubuntu):
status: New → Invalid
Revision history for this message
m4t (m4t) wrote :

hey jamie,
thanks for bringing those changes to light. please do post the patch.
-matt

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi m4t,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 673236

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
m4t (m4t) wrote :

yes, this is still an issue. i've been compiling kernels in a lucid chroot as noted in previous posts. this issue did not present itself using squeeze/amd64 and compiling 2.6.36.2. it may be limited to 32bit x86.

-matt

Revision history for this message
m4t (m4t) wrote :

also, i'm a little unclear as to why this was marked as being invalid, without any comment or explanation. if it's not a problem with binutils, please explain why, and what is necessary to have upstream vanilla kernels build properly.

thanks,
-matt

Changed in binutils (Ubuntu):
status: Invalid → New
Revision history for this message
Jools Wills (jools) wrote :

I can confirm this issue. I currently build kernels on a lucid virtualbox to use on maverick/natty.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: removed: needs-kernel-logs needs-upstream-testing
Revision history for this message
m4t (m4t) wrote :

thanks jools. hopefully this won't be an issue in natty. i'll find out in the next week.
-matt

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Working kernels can be built in Maverick with any one of these:

1. Disabled CONFIG_RELOCATABLE. That makes the problem go away,
    and it makes a bunch of build warnings go away too (they indicate the
    problem, and they really should cause the build to fail, but they just warn).

2. Use an older Binutils linker. The problem is caused by a change in the behaviour
    of linker build scripts in a particular corner case, which is hit by the x86 Linux linker script,
    but only affects CONFIG_RELOCATABLE=y.

3. Judging by the patch comments below, use a newer Binutils linker. I haven't tried this.

4. Take a look at mainline kernel commits. For me a patch similar to the older one below
    fixed booting with Maverick built kernels on *32-bit* x86. The "reported boot failures"
    were x86_64 only as far as I can tell, so this might not help with x86_64.

        6b35eb9ddcddde7b510726de03fae071178f1ec4
        Date: Wed Jan 19 10:09:42 2011 +0100
        Revert "x86: Make relocatable kernel work with new binutils"

        This reverts commit 86b1e8dd83cb ("x86: Make relocatable kernel work with
        new binutils").

        Markus Trippelsdorf reported a boot failure caused by this patch.

        The real solution to the original patch will likely involve an
        arch-generic solution to define an overlaid jiffies_64 and jiffies
        variables.

        Until that's done and tested on all architectures revert this commit to
        solve the regression.

   which reverts the Maverick build fix you are looking for....

        86b1e8dd83cbb0fcbf3d61d2b461df8be1f528cf
        Date: Tue Jan 18 08:57:49 2011 +0800
        x86: Make relocatable kernel work with new binutils

        The CONFIG_RELOCATABLE=y option is broken with new binutils, which will make
        boot panic.

        According to Lu Hongjiu, the affected binutils are from 2.20.51.0.12 to
        2.21.51.0.3, which are release since Oct 22 this year. At least ubuntu 10.10 is
        using such binutils. See:

           http://sourceware.org/bugzilla/show_bug.cgi?id=12327

        The reason of the boot panic is that we have 'jiffies = jiffies_64;' in
        vmlinux.lds.S. The jiffies isn't in any section. In kernel build, there is
        warning saying jiffies is an absolute address and can't be relocatable. At
        runtime, jiffies will have virtual address 0.

5. Enjoy the thread at http://www.gossamer-threads.com/lists/linux/kernel/1328685

Revision history for this message
m4t (m4t) wrote :

thanks. the patch mentioned, 86b1e8dd83cbb0fcbf3d61d2b461df8be1f528cf, currently available at https://patchwork.kernel.org/patch/485751/ fixes the issue. i've attached the patch also.
-matt

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

With regards to the kernel, it appears the patch noted in comment #17 is included as of upstream v2.6.38-rc1:

$ git describe --contains 86b1e8dd83cbb0fcbf3d61d2b461df8be1f528cf
v2.6.38-rc1~3^2~1

As such I'm marking the actively developed linux task (ie currently Oneiric) as Fix Released. I'll open a Maverick nomination for this to be considered for SRU. Thanks.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Hi Leann,

The patch you said is included upstream was reverted the next day in v2.6.38-rc2, so the reason given for Fix Released status is not applicable.

I don't have permission to revert the status change, so please do that if you agree.

$ git describe --contains 6b35eb9ddcddde7b510726de03fae071178f1ec4
v2.6.38-rc2~32^2~1

commit 6b35eb9ddcddde7b510726de03fae071178f1ec4
Author: Ingo Molnar <email address hidden>
Date: Wed Jan 19 10:09:42 2011 +0100

    Revert "x86: Make relocatable kernel work with new binutils"

    This reverts commit 86b1e8dd83cb ("x86: Make relocatable kernel work with
    new binutils").

    Markus Trippelsdorf reported a boot failure caused by this patch.

It might be that the issue does not occur with the binutils in Oneiric anyway, as the reverted kernel patch (86b1e8dd83cb) mentions a binutils version range with an upper limit. I have not tried building a kernel with CONFIG_RELOCATABLE=y on Oneiric to find out. If binutils behaviour was changed a second time, fixing this issue, it might be more helpful to backport the binutils change to Maverick instead of a kernel change.

Revision history for this message
Matthias Klose (doko) wrote :

newer kernels do work with the changes binutils
maverick isn't supported anymore.

Changed in binutils (Ubuntu):
status: New → Fix Released
Changed in binutils (Ubuntu Maverick):
status: New → Won't Fix
Revision history for this message
Julian Wiedmann (jwiedmann) wrote :

This release has reached end-of-life [0].

[0] https://wiki.ubuntu.com/Releases

Changed in linux (Ubuntu Maverick):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.