dpkg segfaults during debootstrap on natty armel

Bug #674146 reported by Oliver Grawert
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro GCC
Fix Released
High
Chung-Lin Tang
debootstrap (Ubuntu)
Undecided
Loïc Minier
Natty
Undecided
Loïc Minier
dpkg (Ubuntu)
High
James Hunt
Natty
High
James Hunt
gcc-4.5 (Ubuntu)
High
Unassigned
Natty
High
Unassigned

Bug Description

Binary package hint: dpkg

during debootstrap when trying to Install core packages, dpkg dies with a segfault like:
...
I: Extracting util-linux...
I: Extracting xz-utils...
I: Extracting zlib1g...
I: Installing core packages...
W: Failure trying to run: chroot /build/chroot-livecd dpkg --force-depends --install /var/cache/apt/archives/base-files_5.0.0ubuntu25_armel.deb /var/cache/apt/archives/base-passwd_3.5.22_armel.deb

researching the cause it seems that the Maintainer and Description fields are expected in /var/lib/dpkg/status
it is segfaulting with a status file as created by debootstrap:

Package: dpkg
Version: 1.15.8.5ubuntu1
Status: install ok installed

while the following status file makes it possible to go on with debootstrap calling the necessary commands manually:

Package: dpkg
Essential: yes
Status: install ok installed
Maintainer: bar
Version: 1.15.8.5ubuntu1
Description: foo

Related branches

Oliver Grawert (ogra)
Changed in dpkg (Ubuntu):
importance: Undecided → High
Changed in dpkg (Ubuntu Natty):
milestone: none → natty-alpha-1
Colin Watson (cjwatson)
Changed in dpkg (Ubuntu Natty):
assignee: nobody → James Hunt (jamesodhunt)
status: New → Triaged
Revision history for this message
James Hunt (jamesodhunt) wrote :

There do appear to be potentially 2 problems here:

1) the issue whereby /var/lib/dpkg/status is somehow truncated

    "dpkg-query -l" indeed expects to find the following fields for a package as a minimum:

    name, description, maintainer, version, status (required to actually get any output from "dpkg -l")

2) "dpkg-query -l" is segfaulting when given the invalid "status" file

The code detects the problem but rather than issuing an error message, it gets a SIGSEGV. What we expect to see from "dpkg -l" under the conditions in (1) is:

warning, in file '/var/lib/dpkg/status' near line 3 package 'dpkg':
 missing description
warning, in file '/var/lib/dpkg/status' near line 3 package 'dpkg':
 missing maintainer
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Description
+++-=============================-=============================-==========================================================================
ii dpkg 1.15.8.5ubuntu1
______

I've looked at (2) first: the problem is observed when va_end() is called in lib/dpkg/parsehelp.c:parse_warn(). Call hierarchy is:
modstatdb_init() -> parsedb() -> parse_ensure_have_field() -> parse_warn(). Once the function returns (just after calling va_end()), the stack has been trashed. valgrind is alas no help here as it fails on armel ("valgrind: the 'impossible' happened" [bug to be raised]).

Revision history for this message
James Hunt (jamesodhunt) wrote :

Confirmed problem exists with version dpkg 1.15.8.5ubuntu1 (armel). Certainly appears to be armel specific at this point (this version of dpkg works fine on i386, AMD64, PPC).

Revision history for this message
Loïc Minier (lool) wrote :

James, try the valgrind in:
https://launchpad.net/~linaro-maintainers/+archive/overlay/+packages

it's a more recent snapshot with fairly complete ARM support

Revision history for this message
James Hunt (jamesodhunt) wrote :

Problem is caused by gcc optimizations ("-O2"). Building without optimizations using DEB_BUILD_OPTIONS="nostrip noopt" and gcc-4.5 (4.5.1-9ubuntu1) results in expected behaviour of "dpkg-query -l".

This looks like a bug in gcc since a rebuild using gcc-4.4 (4.4.5-2ubuntu1) *with* "-O2" gives expected behaviour (ie no SIGSEGV).

Tested with libc-2.12.1-0ubuntu9.

Revision history for this message
James Hunt (jamesodhunt) wrote :

To be clear, building with "-O2" and gcc-4.5 (4.5.1-9ubuntu1) causes the crash.

Revision history for this message
James Hunt (jamesodhunt) wrote :

I may be heading down the wrong path, but essentially the summary so far is:

- building dpkg with gcc-4.5 with -O2 results in SIGSEGV behaviour
- building dpkg with gcc-4.5 with -O0 results in correct behaviour
- building dpkg with gcc-4.4 with -O2 results in correct behaviour
- building dpkg with gcc-4.4 with -O0 results in correct behaviour

It may still be an application bug: Scott mentioned to me earlier that
there have been changes in gcc/libc? for ARM which affect the way in which
memcpy is implemented (farming the heavy lifting to optimized machine
instructions). These can uncover application bugs where memcpy is
technically being misused since (src+dst overlap so dev should have used
memmove). I'm planning to follow this up using an interposed memcpy with
an assert to check for overlap.

Revision history for this message
Colin Watson (cjwatson) wrote :

Thanks, James! You're right that sometimes optimisation changes can simply expose application bugs, but in this case I agree with you that this appears to be a compiler bug. I did a little bit of single-stepping through parse_warn once you pointed that out as the place where everything was going wrong, and it seems to be setting up the input buffer for vfprintf wrongly. The first few lines of code in that function are:

  parse_error_msg(ps, pigp, _("warning"), buf1);
  q = str_escape_fmt(buf2, buf1);
  strcat(q, fmt);

The strcat is compiled to this assembly (according to 'disas' in gdb):

   0x00010258 <+80>: mov r1, r8
   0x0001025a <+82>: mov r7, r0
   0x0001025c <+84>: blx 0x9a40 <strcat>

r8 is set up earlier, before the call to parse_error_msg, like this:

   0x00010232 <+42>: ldr.w r8, [r6], #4

At that point, according to 'info registers', it's pointing to fmt. But r8 is spilled by the call to parse_error_msg, so strcat is not in fact called with the arguments it's supposed to be called with. You can see even with C-level single-stepping that it's entirely wrong - buf2 ends up with two copies of the error message generated by parse_error_msg, rather than one copy followed by fmt.

As the next step, it might be worth looking in 'info gcc' to find the individual -ffoo optimisation flags enabled by -O2 vs. -O0, and bisect to find which ones generate incorrect assembly. For a compiler bug, in general, we should also be trying to produce a reduced test case, as this will need to be a bit smaller than "compile dpkg" in order to pass to a compiler engineer.

Revision history for this message
Matthias Klose (doko) wrote :

before bisecting, maybe recheck with
 - gcc-4.5 from debian experimental (using -march=armv7-a)
 - gcc-4.5 from natty using -march=armv5t

tags: added: armel
Revision history for this message
Matthias Klose (doko) wrote :

Linaro, please could you have a look?

Revision history for this message
Loïc Minier (lool) wrote :

Because there were sprintf/strcat etc. calls around, I tried rebuilding dpkg with -fno-stack-protector -U_FORTIFY_SOURCE, but this didn't help; parse_error_msg corrupts r6 instead of r8.

With a corrupted status file and Debian sid's dpkg I would get for instance:
warning, in file '/var/lib/dpkg/status' near line 18 package 'libsepol1-dev':
 missing maintainer
dpkg-query: parse error, in file '/var/lib/dpkg/status' near line 26 package 'libtext-wrapi18n-perl':
 duplicate value for `Maintainer' field

I've build Ubuntu's dpkg under Debian sid + experimental gcc-4.5/g++-4.5/deps + -mthumb -march=armv7-a and got:
dpkg-query: H�

so apparently, memory corruption; I apparently could run through parse_warn(), but it died somewhere in parsedb().

under Ubuntu natty with -marm -march=armv5t I got:
*** stack smashing detected ***: /usr/bin/dpkg-query terminated
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted

The stack smashing occurs in parse_warn(); after the call to str_escape_fmt() and the subsequent strcat(), but corruption would likely have been earlier.

I am not sure how to interpret the datapoints; Ubuntu has different issues with different flags, Debian has no issue with 4.4 and the Debian default flags, but has issues with 4.5 and our flags.

I guess I could try Debian + 4.5 with Debian's default flags.

In any case, I agre the Ubuntu issues need to be reduced to simpler test cases

(Is anybody working on avoiding the corrupted status in the first place?)

Revision history for this message
Loïc Minier (lool) wrote :

BTW I wanted to share my setup a bit:
I'm creating qemu-arm-static chroots with:
sudo qemu-debootstrap --arch=armel --variant=minbase sid sid-armel http://mirror/debian

and chrooting into them with plain "chroot sid-armel /bin/bash" to build + run binaries.

To debug, I'm using the armel cross gdb package at http://hudson.dooz.org/job/armel-gdb/; in the chroot I launch:
qemu-arm-static -g 1234 /usr/bin/dpkg-query -l

and I run arm-linux-gnueabi-gdb /chroot/usr/bin/dpkg-query, and then "target remote 127.0.0.1:1234" and "c" to start the program; since I build dpkg with nostrip, I get debug symbols in this gdb session

Revision history for this message
Loïc Minier (lool) wrote :

I've built Ubuntu's dpkg under Debian sid + experimental gcc-4.5/g++-4.5/deps again, but without touching the flags and I get a similar corruption as with Ubuntu's gcc-4.5.

NB: I get some weird build failures in the Debian chroot like:
dpkg-shlibdeps: error: no dependency information found for /lib/libc.so.6 (used by debian/dpkg/usr/bin/dpkg-deb).
but I don't think these relate to this bug; I also got a dpkg build failure in Ubuntu with linux-libc-dev 2.6.37; I've sent a patch at http://lkml.org/lkml/2010/11/12/351

Loïc Minier (lool)
Changed in debootstrap (Ubuntu Natty):
assignee: nobody → Loïc Minier (lool)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package debootstrap - 1.0.25ubuntu1

---------------
debootstrap (1.0.25ubuntu1) natty; urgency=low

  * scripts/gutsy, scripts/sid: add dummy Maintainer: and Description: to the
    dpkg status file entry for dpkg as to avoid dpkg -l warnings; LP: #674146.
 -- Loic Minier <email address hidden> Sat, 13 Nov 2010 12:34:58 +0100

Changed in debootstrap (Ubuntu Natty):
status: New → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

Loïc, I'm not sure I'm happy about this debootstrap change. 'dpkg -l' was just an artificial test case: 'dpkg -i', which is the thing that debootstrap actually calls, doesn't require Description and Maintainer fields. I think we should keep debootstrap's status file strictly minimal.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Update:

- dmalloc shows up nothing.
- memcpy() calls seem legitimate after checking with an interpositioning tool.

I've attempted both a top-down and bottom-up approach to hone in on this issue:

- I've attempted to stimulate a segv by writing an app doing "similar" things to "dpkg -l" but no joy in recreating issue yet.
- I am currently stripping dpkg down to a "minimal testcase" which exhibits the stack smashing behaviour...

James Hunt (jamesodhunt)
Changed in dpkg (Ubuntu Natty):
status: Triaged → In Progress
Revision history for this message
Oliver Grawert (ogra) wrote :

while i agree with colin, it helps the arm team to have daily builds again.
i was assuming that the debootstrap change is only meant as temporary fix.

Revision history for this message
Loïc Minier (lool) wrote :

Colin, I confirmed that the debootstrap change fixed debootstrap on armel; without the change, debootstrap still fails. So dpkg is running similar consistency checks as dpkg-query -l.

The debootstrap output is:
I: Installing core packages...
W: Failure trying to run: dpkg --force-depends --install /var/cache/apt/archives/base-files_5.0.0ubuntu25_armel.deb /var/cache/apt/archives/base-passwd_3.5.22_armel.deb

Checking debootstrap.log I only see:
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
Segmentation fault

Running dpkg --force-depends --install /var/cache/apt/archives/base-files_5.0.0ubuntu25_armel.deb /var/cache/apt/archives/base-passwd_3.5.22_armel.deb in the chroot reproduces the problem.
Only adding Maintainer or Description was not enough to allow this dpkg -i to succeed.

Are you saying that you don't mind triggering the warnings in dpkg because dpkg has strictly enough data to complete? I guess that's fair as the warning is not user-visible; I personally prefer matching the interface as well as possible and avoiding warnings were I can to avoid clutter while debugging, but it's also sensible to keep debootstrap as minimal as possible.

We should keep the change until dpkg is fixed though.

Would you close the Debian bug with the debootstrap change with your rationale so that Debian folks read it directly from you?

Revision history for this message
James Hunt (jamesodhunt) wrote :
tags: added: patch
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package dpkg - 1.15.8.5ubuntu2

---------------
dpkg (1.15.8.5ubuntu2) natty; urgency=low

  * debian/rules: Disabled compiler optimizations on arch "armel" (LP: #674146)
    This is a temporary work-around until we can identify the cause of the
    stack corruption and subsequent SIGSEGV.
 -- James Hunt <email address hidden> Tue, 16 Nov 2010 13:54:56 +0000

Changed in dpkg (Ubuntu Natty):
status: In Progress → Fix Released
Revision history for this message
James Hunt (jamesodhunt) wrote :

The plot thickens...

Noted that "-O1" does not cause SIGSEGV with gcc-4.5. Confirmed that the equivalent set of gcc options for "-O1"...

  gcc-4.5 -Q -O1 --help=optimizers|grep enabled|awk '{print $1}'|tr '\n' ' '

... does not cause SIGSEGV. However, it appears that "-O2" is not the same as...

  gcc-4.5 -Q -O2 --help=optimizers|grep enabled|awk '{print $1}'|tr '\n' ' '

... since the latter does *NOT* cause a SIGSEGV. Looks like another gcc bug.

(FWIW "clang -O2" works fine :)

Steve Langasek (vorlon)
Changed in gcc-4.5 (Ubuntu Natty):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Steve Langasek (vorlon) wrote :

If this is a compiler bug, then we should have a task open for the compiler so we can get this fixed as well.

Revision history for this message
Andrew Stubbs (ams-codesourcery) wrote :

@ James Hunt

Trying to reproduce -O1 or -O2 by listing a set of optimization flags is invalid.

a) most optimization flags are no-ops at the default -O0 (optimization disabled);

b) There are some optimizations that are enabled at a given optimization level, but do not have a named flag.

However, it is valid to say "-O2 -fno-whatever" to see if a specific optimization pass is at fault (although disabling an earlier one can hide a bug in a later one, so you can't make any firm statements).

Revision history for this message
Michael Hope (michaelh1) wrote :

Also occurs with plain Linaro GCC 4.5-2010.11-1.

Revision history for this message
Michael Hope (michaelh1) wrote :

This appears to be a compiler bug. Using plain gcc-linaro-4.5-2010.11-1, parse_warn includes the following assembly:

 266: f7ff fffe bl 0 <dcgettext>
   266: R_ARM_THM_CALL dcgettext
 26a: 4632 mov r2, r6
 26c: f104 0108 add.w r1, r4, #8
 270: aefd add r6, sp, #1012 ; 0x3f4
 272: 9600 str r6, [sp, #0]
 274: 4603 mov r3, r0
 276: 1d20 adds r0, r4, #4
 278: f7ff ff8a bl 190 <parse_error_msg.clone.1>
 27c: 4631 mov r1, r6
 27e: a803 add r0, sp, #12
 280: f7ff fffe bl 0 <str_escape_fmt>

Note that r6 is set to &buf at 0x270, stored on the stack at 0x272, and put into r1 at 0x27c. parse_error_msg() however corrupts r6:

00000190 <parse_error_msg.clone.1>:
 190: b580 push {r7, lr}
 192: b082 sub sp, #8
 194: 4614 mov r4, r2
 196: 4606 mov r6, r0
 198: 460d mov r5, r1

which causes str_escape_fmt() to stomp on memory.

Either the compiler should call parse_error_msg (not the clone), or save r6 in the clone, or restore r6 from the stack before calling str_escape_fmt().

Changed in gcc-linaro:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Michael Hope (michaelh1) wrote :

Note that the above is the same as Colin Watson's diagnosis. Attached is the preprocessed source that shows the problem when compiled with gcc -O2.

tags: added: bad-code
Chung-Lin Tang (cltang)
Changed in gcc-linaro:
assignee: nobody → Chung-Lin Tang (cltang)
Revision history for this message
Chung-Lin Tang (cltang) wrote :

This seems to be a case of PR44768, http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44768
Basically, early IPA SRA did not properly set up context, causing the parse_error_msg.clone.1 function getting marked as 'volatile' (no return), thus assuming callee-save registers can be clobbered...

Mainline trunk revision 161947 fixes this, I'll get this backported.

Michael Hope (michaelh1)
Changed in gcc-linaro:
status: Triaged → In Progress
Revision history for this message
Matthias Klose (doko) wrote :

with the backport of PR44768, this looks better:

--- parsehelp.s 2010-11-26 15:19:47.000000000 +0000
+++ ../parsehelp.s 2010-11-26 15:08:31.000000000 +0000
@@ -208,16 +208,15 @@
        .thumb_func
        .type parse_error_msg.clone.1, %function
 parse_error_msg.clone.1:
- @ Volatile: function does not return.
        @ args = 4, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
- push {r7, lr}
+ push {r4, r5, r6, r7, r8, lr}
        sub sp, sp, #8
        mov r4, r2
        mov r6, r0
        mov r5, r1
        mov r7, r3
- ldr r8, [sp, #16]
+ ldr r8, [sp, #32]
        cbz r2, .L41
        ldr r3, [r2, #4]
        cbz r3, .L41
@@ -236,7 +235,7 @@
        mov r0, r8
        bl sprintf
        add sp, sp, #8
- pop {r7, pc}
+ pop {r4, r5, r6, r7, r8, pc}
 .L41:
        movw r1, #:lower16:.LC10
        movs r2, #5
@@ -246,11 +245,11 @@
        ldr r4, [r5, #0]
        ldr r3, [r6, #0]
        mov r2, r7
- str r4, [sp, #16]
+ str r4, [sp, #32]
        mov r1, r0
        mov r0, r8
        add sp, sp, #8
- pop {r7, lr}
+ pop {r4, r5, r6, r7, r8, lr}
        b sprintf
        .size parse_error_msg.clone.1, .-parse_error_msg.clone.1
        .align 2

Revision history for this message
James Hunt (jamesodhunt) wrote :

Tested with Matthias' gcc-4.5 (4.5.1-10ubuntu3): when compiling with "-O2" seg fault no longer observed.

Revision history for this message
James Hunt (jamesodhunt) wrote :

@cltang: thank you very much for identifying the cause+fix!

Revision history for this message
Matthias Klose (doko) wrote :

no regressions on x86_64/i686.

Revision history for this message
Matthias Klose (doko) wrote :

dpkg should be rebuilt after the fixed gcc-4.5 is in the archive

Changed in dpkg (Ubuntu Natty):
status: Fix Released → Confirmed
Revision history for this message
Matthias Klose (doko) wrote :

and the debootstrap change reverted after dpkg is rebuilt

Changed in debootstrap (Ubuntu Natty):
status: Fix Released → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gcc-4.5 - 4.5.1-10ubuntu3

---------------
gcc-4.5 (4.5.1-10ubuntu3) natty; urgency=low

  * Update to SVN 20101126 (r167167) from the gcc-4_5-branch.
    - Fix PR rtl-optimization/46571, PR target/31100, PR c/46547,
      PR fortran/46638.

  [ Matthias Klose ]
  * Fix powerpc and s390 builds when biarch is disabled.
  * Revert Linaro issue #1259.
  * Backport PR bootstrap/44768, miscompilation of dpkg on ARM
    with -O2 (Chung-Lin Tang). LP: #674146.

  [ Konstantinos Margaritis ]
  * Add support for new target architecture `armhf'. Closes: #603948.
 -- Matthias Klose <email address hidden> Fri, 26 Nov 2010 17:29:48 +0100

Changed in gcc-4.5 (Ubuntu Natty):
status: Confirmed → Fix Released
Revision history for this message
Oliver Grawert (ogra) wrote :

debootstrap was reverted three days ago already

debootstrap (1.0.25ubuntu2) natty; urgency=low

  * revert addition of dummy Maintainer: and Description:, the LP bug #674146
    is being worked around by dropping dpkg optimization on armel instead.

Changed in debootstrap (Ubuntu Natty):
status: Confirmed → Invalid
Michael Hope (michaelh1)
Changed in gcc-linaro:
milestone: none → 4.5-2010.12-0
status: In Progress → Fix Committed
Revision history for this message
James Hunt (jamesodhunt) wrote :

Issue is not with dpkg, but with gcc so change status for dpkg.

Changed in dpkg (Ubuntu Natty):
status: Confirmed → Invalid
Revision history for this message
Colin Watson (cjwatson) wrote :

I've restored optimisations to dpkg as of 1.15.8.6ubuntu1, just uploaded.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package dpkg - 1.15.8.6ubuntu1

---------------
dpkg (1.15.8.6ubuntu1) natty; urgency=low

  * Resynchronise with Debian. Remaining changes:
    - Adjust versioned emacs22 conflicts to cope with versions in Ubuntu.
    - cputable: Set cpu to i686 for arch i386.
  * Restore optimisations on armel, now that the compiler is believed to be
    fixed (LP: #674146).

dpkg (1.15.8.6) unstable; urgency=low

  [ Raphaël Hertzog ]
  * Ensure debian/source/local-options is always excluded from the source
    package even if the user provides customized -i or -I options.
    Closes: #597023
  * Fix Dpkg::Version's handling of version with a debian revision but an
    empty version (e.g. "-0.1"). Thanks to James Vega <email address hidden>
    for the patch. Closes: #597651
  * With "3.0 (quilt)" source package, create .pc/.quilt_series with the
    correct series file if the source package provides vendor specific patch
    sets.

  [ Guillem Jover ]
  * Disable by default usage of synchronous sync(2), as it causes undesired
    I/O on unrelated file systems. Closes: #588339, #595927, #600075
  * Add new --force-unsafe-io to disable safe I/O operations on unpack.
    Closes: #584254

  [ Updated man page translations ]
  * French (Christian Perrier). Including a typo fix and a typographical
    change reported by Vincent Danjean. Closes: #601852
  * Spanish (Omar Campagne). Closes: #596519

  [ Updated programs translations ]
  * Basque (Iñaki Larrañaga Murgoitio). Closes: #599923
  * Catalan (Jordi Mallach).
  * Danish (Ask Hjorth Larsen). Closes: #600240
  * German (Sven Joachim). Improved by Holger Wansing.
  * Italian (Pietro Battiston). Fix translation of "however". Closes: #602518
  * Portuguese (Miguel Figueiredo). Closes: #596168
  * Romanian (Andrei Popescu). Closes: #604769
  * Russian (Yuri Kozlov). Closes: #595455
  * Vietnamese (Clytie Siddall). Closes: #598473

  [ Updated scripts translations ]
  * Catalan (Jordi Mallach).
  * German (Sven Joachim).

  [ Updated dselect translations ]
  * Catalan (Jordi Mallach).
  * German (Sven Joachim).
 -- Colin Watson <email address hidden> Fri, 03 Dec 2010 16:47:01 +0000

Changed in dpkg (Ubuntu Natty):
status: Invalid → Fix Released
Michael Hope (michaelh1)
Changed in gcc-linaro:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.