stack corruption running "go install launchpad.net/juju-core/..."

Bug #1279620 reported by Michael Hudson-Doyle on 2014-02-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
eglibc
Confirmed
Medium
eglibc (Ubuntu)
High
Adam Conrad
Trusty
High
Adam Conrad

Bug Description

This is the sort of bug I was hoping we were going to escape in this porting effort :(

It's not frequent, but maybe 10% of the time go install launchpad.net/juju-core/... will fail with some kind of memory error. It's not exactly the same error every time, but it's consistent with something scribbling on the stack. It doesn't seem to happen if you run pass -p 1 to install, which limits the parallelism inside the install command, so it's probably some kind of race. Yay!

Michael Hudson-Doyle (mwhudson) wrote :

So, some playing around reveals:

1) it's not associated with gc (running with GOGC=off and it still fails)

2) it's associated with processes (if you just create a file in the expected place using Go code rather than running gccgo -- or even touch! -- to create it, it doesn't fail)

3) gdb has all sorts of strange problems trying to probe the heap structures :/

Michael Hudson-Doyle (mwhudson) wrote :

Another clue: I think it's signal handling related because (a) one of the values that gets smashed onto the stack is 0x11, which is SIGCHLD, and (b) if you use gdb to prevent SIGCHLD from being passed to the process (handle SIGCHLD nopass) it works.

Michael Hudson-Doyle (mwhudson) wrote :

OK, so it's definitely signal handling: for whatever reason, signals are not being executed on the stack requested via sigaltstack. Sooner or later, one gets executed where writing to the stack causes real problems.

You can make this happen a LOT more frequently by invoking "touch" instead of whatever program go is running, this sort of thing in build.go:runOut:

+ ncl := []string{"touch"}
+ for i, c := range cmdline {
+ if c == "-o" || c == "cru" {
+ f := cmdline[i+1]
+ if f[0] != '/' { f = dir + "/" + f }
+ ncl = append(ncl, f)
+ }
+ }
+ cmdline = ncl

So I'm suspecting a kernel bug now. The go runtime uses getcontext/makecontext/setcontext a lot and something must be getting confused.

Michael Hudson-Doyle (mwhudson) wrote :

In fact, the signal gets executed on another thread's stack. Not sure if that's just coincidence or not, but it happens reasonably often.

Created attachment 7435
adapted from example in http://pubs.opengroup.org/onlinepubs/009695399/functions/makecontext.html

I'm attaching a simple demo program for makecontext/swapcontext that I found somewhere with the addition of code to print the alternate signal stack before and after calling swapcontext. On aarch64 it prints this:

start f2
start f1
finish f2
finish f1
{ss_sp: (nil), ss_flags: 2, ss_size: 0}
{ss_sp: 0x7ffbe931f8, ss_flags: 0, ss_size: 8192}

It turns out that because setcontext is implemented in terms of the rt_sigreturn syscall it ends up copying the uc_stack data from the passed context into the (kernel) task's sigaltstack parameters. Hilarity ensues. Specifically it means that programs linked against the gccgo runtime sometimes handle signals with SP pointing at memory that another thread is using for its stack, with predictably bad results.

Matthias Klose (doko) on 2014-02-24
affects: gcc-4.8 (Ubuntu) → eglibc (Ubuntu)
Changed in eglibc (Ubuntu Trusty):
importance: Undecided → High
milestone: none → ubuntu-14.03
status: New → Confirmed
Changed in eglibc:
importance: Unknown → Medium
status: Unknown → Confirmed
Michael Hudson-Doyle (mwhudson) wrote :

Will has posted a patch to the glibc bug that appears to fix the problem. I'm currently building some debs to make available for testing.

Michael Hudson-Doyle (mwhudson) wrote :

And I'm attaching a debdiff that fixes this problem, adds a test and has a fix for another problem that doesn't affect the buildds but does affect builds on newer kernels.

I still fail at quilt, so here is where the patches come from:

fix: https://sourceware.org/ml/libc-alpha/2014-04/msg00006.html
test: https://sourceware.org/ml/libc-alpha/2014-04/msg00008.html
alignment fix: https://sourceware.org/bugzilla/show_bug.cgi?id=16796#c1

I hope this is useful!

Adam Conrad (adconrad) on 2014-04-03
Changed in eglibc (Ubuntu Trusty):
assignee: nobody → Adam Conrad (adconrad)
status: Confirmed → In Progress

The attachment "debdiff-context-align-test.diff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Chris Halse Rogers (raof) wrote :

Removing sponsors; looks like Adam's got this.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eglibc - 2.19-0ubuntu5

---------------
eglibc (2.19-0ubuntu5) trusty; urgency=medium

  * Merge with unreleased 2.19 from Debian experimental, fixing more bugs:
    - Pull in arm64 patches to fix setcontext corruption (LP: #1279620)
    - Apply the IBM 2.19 branch for POWER8 bug fixes and optimizations.
    - Change M_CHECK_ACTION to abort if first MALLOC_CHECK_ bit is set.
 -- Adam Conrad <email address hidden> Wed, 09 Apr 2014 18:27:57 -0600

Changed in eglibc (Ubuntu Trusty):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.