gccgo has issues when page size is not 4kB

Bug #1304754 reported by Dave Cheney on 2014-04-09
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
gcc
Fix Released
Medium
gcc-4.9 (Ubuntu)
Medium
Unassigned
Trusty
Undecided
Unassigned
Utopic
Medium
Unassigned
gccgo-4.9 (Ubuntu)
Medium
Unassigned
Trusty
High
Matthias Klose
Utopic
Medium
Unassigned

Bug Description

On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is killing gccgo compiled binaries

[18519.444748] jujud[19277]: bad frame in setup_rt_frame:
0000000000000000 nip 0000000000000000 lr 0000000000000000
[18519.673632] init: juju-agent-ubuntu-local main process (19220)
killed by SEGV signal
[18519.673651] init: juju-agent-ubuntu-local main process ended, respawning

In powerpc/kernel/signal_64.c:

sys_rt_sigreturn is jumping to the badframe: label and executing an unconditional force_sigsegv which is delivered to the userland process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer access and blame some random function that happened to be the top stack frame.

Reverting to the 3.13-08 kernel appears to resolve the issue which (weakly) points the finger at the recent switch to 64k pages.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1304754

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-trusty/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key ppc64el trusty
Joseph Salisbury (jsalisbury) wrote :

If the bug still exists with the latest mainline kernel, we can perform a bisect to identify the fix that introduced this. However, if the mainline kernel resolves this bug, we can perform a "Reverse" bisect to identify the commit that fixes this.

tags: added: performing-bisect
Anton Blanchard (anton-samba) wrote :

Based on the fail, I took a look at how gccgo handles stacks. It relies on the split stack feature in gold, which doesn't appear to be implemented for ppc64.

Running one of the go recursion testcases (attached) shows what happens when we run out of stack and don't have the split stack feature to save us:

#gccgo -g -O2 -o peano peano.go
# ./peano
Segmentation fault

And we get the setup_rt_frame error in dmesg:

peano[4538]: bad frame in setup_rt_frame: 000000c20ff7f000 nip 0000000010001018 lr 0000000010001024

As expected, we are just continually recurse without checking out stack pointer for overflow:

   0x0000000010001008 <+8>: cmpdi r3,0
   0x000000001000100c <+12>: beq 0x10001040 <main.count+64>
   0x0000000010001010 <+16>: mflr r0
   0x0000000010001014 <+20>: std r0,16(r1)
   0x0000000010001018 <+24>: stdu r1,-32(r1)
   0x000000001000101c <+28>: ld r3,0(r3)
   0x0000000010001020 <+32>: bl 0x10001008 <main.count+8>

Thanks Anton, this is great debugging.

I tried the peano experiment on my -8 (4k) kernel and it failed as expected.

I talked to the upstream who said that ./configure should detect that
-fsplit-stack isn't supported on PPC and fall back to giving each
goroutine a full stack.

I will investigate this today.

With this said, should this bug be reassiged to gccgo (trusty) ?

On Thu, Apr 10, 2014 at 8:44 PM, Anton Blanchard <email address hidden> wrote:
> Based on the fail, I took a look at how gccgo handles stacks. It relies
> on the split stack feature in gold, which doesn't appear to be
> implemented for ppc64.
>
> Running one of the go recursion testcases (attached) shows what happens
> when we run out of stack and don't have the split stack feature to save
> us:
>
> #gccgo -g -O2 -o peano peano.go
> # ./peano
> Segmentation fault
>
> And we get the setup_rt_frame error in dmesg:
>
> peano[4538]: bad frame in setup_rt_frame: 000000c20ff7f000 nip
> 0000000010001018 lr 0000000010001024
>
> As expected, we are just continually recurse without checking out stack
> pointer for overflow:
>
> 0x0000000010001008 <+8>: cmpdi r3,0
> 0x000000001000100c <+12>: beq 0x10001040 <main.count+64>
> 0x0000000010001010 <+16>: mflr r0
> 0x0000000010001014 <+20>: std r0,16(r1)
> 0x0000000010001018 <+24>: stdu r1,-32(r1)
> 0x000000001000101c <+28>: ld r3,0(r3)
> 0x0000000010001020 <+32>: bl 0x10001008 <main.count+8>
>
>
> ** Attachment added: "peano.go"
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+attachment/4079310/+files/peano.go
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels
>
> Status in "linux" package in Ubuntu:
> Incomplete
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+subscriptions

FWIW, my guess is that the reason this appears to be "fixed" when dropping to a smaller pagesize is simply because you're exhausting the stack more slowly. Given that it takes a long while reproduce with juju (or, so I've been told), it does also beg the question if juju has a slow memory leak.

affects: linux (Ubuntu) → gccgo-4.9 (Ubuntu)
Changed in gccgo-4.9 (Ubuntu):
status: Incomplete → Confirmed
summary: - gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels
+ gccgo on ppc64el using split stacks when not supported
tags: removed: kernel-da-key performing-bisect
Dave Cheney (dave-cheney) wrote :

Anton:

I've done some experiments with the peano.go test and confirmed that gccgo on ppc is correctly configured to not use f-split-stack. It turns out the peano.go can't pass without split stacks. On gccgo/ppc64 the program crashes at a stack depth of

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fffb7722220 (LWP 24713)]
0x0000000010004e0c in main.is_zero ()
(gdb) bt
#0 0x0000000010004e0c in main.is_zero ()
#1 0x00000000100051fc in main.count ()
#2 0x000000001000522c in main.count ()
...
#31380 0x000000001000522c in main.count ()
#31381 0x0000000010005854 in main.main ()

I think the peano example is just a straght 'fall off the stack' type error, it also generates a slightly different
ubuntu@winton-02:~/go/test$ ./a.out
Segmentation fault (core dumped)
ubuntu@winton-02:~/go/test$ dmesg | tail -n1
[501663.078093] a.out[25679]: bad frame in setup_rt_frame: 000000c20ffaf0e0 nip 0000000010004e0c lr 00000000100051fc

Anton Blanchard (anton-samba) wrote :

I've made some progress with these fails. A lot of the confusion is around the way gccgo hooks the SEGV handler and attempts to backtrace all goroutines (the code is in runtime_tracebackothers())

It does this by calling runtime_gogo() which temporarily switches to the goroutine using setcontext(). If the context is bad in any way, this will cause us to SEGV again. I printed out the stack pointer (r1) and the NIA during this stack backtracing, and we see where things go south just as we are about to dump goroutine 0:

goroutine 0 [idle]:
DEBUG: runtime_gogo r1 0 nia 0

r1 = 0, nia = 0. When we call setcontext on this invalid context we die with:

jujud[5258]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000

Perhaps we aren't saving away the context for goroutine 0 correctly.

Anton Blanchard (anton-samba) wrote :

This doesn't explain why we failed in the first place however. Using gdb, I have seen a couple of SEGVs in:

* 1 Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc (dummy=<optimized out>) at ../../../gcc/libgo/runtime/time.goc:217

ie:

                        f = (void*)t->fv->fn;

Perhaps a stale timer that we aren't cancelling?

I've also seen a fail here:

fatal error: runtime_lock: lock count

goroutine 2 [running]:
runtime_dopanic
        ../../../gcc/libgo/runtime/panic.c:78
runtime_throw
        ../../../gcc/libgo/runtime/panic.c:116
runtime_lock
        ../../../gcc/libgo/runtime/lock_futex.c:41
runtime_allocmcache
        ../../../gcc/libgo/runtime/malloc.goc:337
runtime_startpanic
        ../../../gcc/libgo/runtime/panic.c:46
runtime_throw
        ../../../gcc/libgo/runtime/panic.c:114
runtime_unlock
        ../../../gcc/libgo/runtime/lock_futex.c:101
runtime_MHeap_Scavenger
        ../../../gcc/libgo/runtime/mheap.c:482
kickoff
        ../../../gcc/libgo/runtime/proc.c:237

        :0

        :0
created by runtime_main
        ../../../gcc/libgo/runtime/proc.c:565

On Wed, Apr 16, 2014 at 4:26 PM, Anton Blanchard <email address hidden> wrote:
> I've made some progress with these fails. A lot of the confusion is
> around the way gccgo hooks the SEGV handler and attempts to backtrace
> all goroutines (the code is in runtime_tracebackothers())
>
> It does this by calling runtime_gogo() which temporarily switches to the
> goroutine using setcontext(). If the context is bad in any way, this
> will cause us to SEGV again. I printed out the stack pointer (r1) and
> the NIA during this stack backtracing, and we see where things go south
> just as we are about to dump goroutine 0:
>
> goroutine 0 [idle]:
> DEBUG: runtime_gogo r1 0 nia 0
>
> r1 = 0, nia = 0. When we call setcontext on this invalid context we die
> with:
>
> jujud[5258]: bad frame in setup_rt_frame: 0000000000000000 nip
> 0000000000000000 lr 0000000000000000
>
> Perhaps we aren't saving away the context for goroutine 0 correctly.

Hmm, could be. It looks like the process was crashing anyway.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Dave Cheney (dave-cheney) wrote :

Hi Anton,

I've been looking at another angle via a different crash. I see a
crash if a child process gets a signal, which sort of reflects back on
the parent.

Are there any alignment requirements for signal handling on 64k kernels ?

Dave

On Wed, Apr 16, 2014 at 4:28 PM, Anton Blanchard <email address hidden> wrote:
> This doesn't explain why we failed in the first place however. Using
> gdb, I have seen a couple of SEGVs in:
>
> * 1 Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc
> (dummy=<optimized out>) at ../../../gcc/libgo/runtime/time.goc:217
>
> ie:
>
> f = (void*)t->fv->fn;
>
> Perhaps a stale timer that we aren't cancelling?
>
> I've also seen a fail here:
>
> fatal error: runtime_lock: lock count
>
> goroutine 2 [running]:
> runtime_dopanic
> ../../../gcc/libgo/runtime/panic.c:78
> runtime_throw
> ../../../gcc/libgo/runtime/panic.c:116
> runtime_lock
> ../../../gcc/libgo/runtime/lock_futex.c:41
> runtime_allocmcache
> ../../../gcc/libgo/runtime/malloc.goc:337
> runtime_startpanic
> ../../../gcc/libgo/runtime/panic.c:46
> runtime_throw
> ../../../gcc/libgo/runtime/panic.c:114
> runtime_unlock
> ../../../gcc/libgo/runtime/lock_futex.c:101
> runtime_MHeap_Scavenger
> ../../../gcc/libgo/runtime/mheap.c:482
> kickoff
> ../../../gcc/libgo/runtime/proc.c:237
>
> :0
>
> :0
> created by runtime_main
> ../../../gcc/libgo/runtime/proc.c:565
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

There shouldn't be any difference in terms of signal handling.

I've now seen a couple of failures in mongodb/TLS networking code:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x38]

goroutine 16 [running]:
crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn
        ../../../gcc/libgo/go/crypto/tls/conn.go:111
labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273
labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474
labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320
labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278
created by mgo.newServer
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80

which is:

func (c *Conn) SetWriteDeadline(t time.Time) error {
        return c.conn.SetWriteDeadline(t)
}

SetWriteDeadline will end up in timer code, and I've previously seen failures in the timer code.

An excellent point. Timers are managed by a single goroutine and a
priority queue of events to wait on and channels to send the timer
event. It should be doable to write some code that stresses timers.

However I don't believe that SIGALARM is used, well at least not in gc
which most of the gccgo standard library extends from, gccgo might be
slightly different.

The event that crashes the go process is related to a watchdog timer
that expires and tries to kill the subprocess.

On Wed, Apr 16, 2014 at 6:04 PM, Anton Blanchard <email address hidden> wrote:
> There shouldn't be any difference in terms of signal handling.
>
> I've now seen a couple of failures in mongodb/TLS networking code:
>
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal 0xb code=0x1 addr=0x38]
>
> goroutine 16 [running]:
> crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn
> ../../../gcc/libgo/go/crypto/tls/conn.go:111
> labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273
> labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474
> labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320
> labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278
> created by mgo.newServer
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80
>
> which is:
>
> func (c *Conn) SetWriteDeadline(t time.Time) error {
> return c.conn.SetWriteDeadline(t)
> }
>
> SetWriteDeadline will end up in timer code, and I've previously seen
> failures in the timer code.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Created attachment 32659
Bump page size to 64kB

We are seeing random failures with go programs on a 64kB page size ppc64 box. It looks like garbage collection issues - sometimes we SEGV in timer code, sometimes we SEGV in the code that wraps a kernel read syscall. If I prevent the garbage collector from running, the programs work.

The libgo malloc hard codes the page size so I wrote a quick hack to bump this (and a few other dependent variables). This makes the problem go away, but we will need to come up with a better way to do this at runtime.

Hi Dave,

It does look like a page size issue. I submitted the following bug:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931

This is going to be true on AARCH64 also where most distros are going to be using 64k pages (some might use 4k pages if they also support AARCH32). MIPS has many different page sizes too (4k, 8k, 16k, 32k, and 64k). So hard coding the page size seems wrong, maybe you should call getpagesize instead.

I agree, but when I tried this I found a few places that expect PageSize to be a compile time constant so it is not as trivial as I had hoped.

It would be extremely helpful if you could find a test case that can recreate this problem with some reliability. There is no obvious dependency on the system page size in libgo. The PageSize constant is the unit that the memory allocator deals in, and should have no direct relationship to the system page size. I believe that there is a bug, but we need to track it down.

If you set the environment variable GOGC=1 the garbage collector will run much more frequently; perhaps that will help get a reproducible test case.

Created attachment 32669
Don't use madvise(DONT_NEED) on sub pages

I think I see it:

19112 madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

That 4kB madvise(MADV_DONTNEED) gets rounded up to the system page size of 64kB and we end up covering still in use memory.

The following patch fixes it for me, but it just ignores any sub pages. We should keep them around so later calls have a chance at consolidating regions up to a system page size.

Perhaps it would be better instead of not doing the madvise at all if start or length isn't page aligned round the start to the next page boundary and end to the previous page boundary and madvise if the rounded end is above the rounded start.

Changed in gcc:
importance: Unknown → Medium
status: Unknown → New

Created attachment 32679
runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages. Most of
the code checks getpagesize() at runtime and does the right thing.

On a 64kB ppc64 box I see SEGVs in long running processes which has
been diagnosed as a bug in scavengelist. scavengelist does a
madvise(MADV_DONTNEED) without rounding the arguments to the system
page size. A strace of one of the failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of valid data
as no longer needed.

Round start up to a system page and end down before calling madvise.

Author: ian
Date: Fri Apr 25 04:28:48 2014
New Revision: 209776

URL: http://gcc.gnu.org/viewcvs?rev=209776&root=gcc&view=rev
Log:
 PR go/60931

runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages.
Most of the code checks getpagesize() at runtime and does the
right thing.

On a 64kB ppc64 box I see SEGVs in long running processes
which has been diagnosed as a bug in scavengelist.
scavengelist does a madvise(MADV_DONTNEED) without rounding
the arguments to the system page size. A strace of one of the
failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of
valid data as no longer needed.

Round start up to a system page and end down before calling
madvise.

Modified:
    branches/gcc-4_9-branch/libgo/runtime/mheap.c

Author: ian
Date: Fri Apr 25 04:29:07 2014
New Revision: 209777

URL: http://gcc.gnu.org/viewcvs?rev=209777&root=gcc&view=rev
Log:
 PR go/60931

runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages.
Most of the code checks getpagesize() at runtime and does the
right thing.

On a 64kB ppc64 box I see SEGVs in long running processes
which has been diagnosed as a bug in scavengelist.
scavengelist does a madvise(MADV_DONTNEED) without rounding
the arguments to the system page size. A strace of one of the
failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of
valid data as no longer needed.

Round start up to a system page and end down before calling
madvise.

Modified:
    trunk/libgo/runtime/mheap.c

Thanks for the patch. I committed a version of it to mainline and 4.9 branch.

summary: - gccgo on ppc64el using split stacks when not supported
+ gccgo has issues when page size is not 4kB
Anton Blanchard (anton-samba) wrote :

A fix has made it into mainline and the 4.9 branch:

http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=209776

Matthias Klose (doko) on 2014-04-25
Changed in gcc-4.9 (Ubuntu Trusty):
status: New → Invalid
Changed in gccgo-4.9 (Ubuntu Trusty):
importance: Undecided → Medium
milestone: none → trusty-updates
status: New → Confirmed
Changed in gccgo-4.9 (Ubuntu Utopic):
status: Confirmed → Invalid
Changed in gcc-4.9 (Ubuntu Utopic):
importance: Undecided → Medium
status: New → Confirmed
Matthias Klose (doko) wrote :

the fix is now in gccgo-4.9 in the doko/toolchain PPA (for trusty).

Changed in gcc:
status: New → Fix Released
Matthias Klose (doko) wrote :

now fixed in gcc-4.9 in utopic

Changed in gcc-4.9 (Ubuntu Utopic):
status: Confirmed → Fix Released
Matthias Klose (doko) wrote :

updated package in the ubuntu-toolchain-r/ppa PPA, removed the one in the doko/toolchain PPA.

Changed in gccgo-4.9 (Ubuntu Trusty):
assignee: nobody → Matthias Klose (doko)

I've been testing this, it's looking good so far. I'd like to run one
more test overnight before giving a thumbs up/down.

On Sun, May 4, 2014 at 2:29 PM, Launchpad Bug Tracker
<email address hidden> wrote:
> ** Branch linked: lp:debian/gcc-4.9
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> Confirmed
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Dave Cheney (dave-cheney) wrote :

Matthias, I've tested this fix with Juju and it looks like it has fixed the problem with 64k kernels.

I've moved this to fix committed, I hope this is the correct status.

Changed in gccgo-4.9 (Ubuntu Trusty):
status: Confirmed → Fix Committed
Dave Cheney (dave-cheney) wrote :

Hi Matt,

Can you please post the output of dmesg, that is the canonical way to
diagnose this issue atm.

On Thu, Jul 10, 2014 at 5:23 AM, Matt Bruzek
<email address hidden> wrote:
> I installed the debian packages from the CI server http://juju-ci.vapour.ws:8080/job/publish-revision/588/
> My understanding is these deb packages were built with the PPA toolchain that has the fix installed.
>
> I destroyed the environment, rebooted the machine (for good measure) and
> find that I still get an error.
>
> https://pastebin.canonical.com/113210/
>
> Are we sure this fixes the problem?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> Fix Committed
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Steve Langasek (vorlon) wrote :

Moving this back to "in progress". Matthias, this seems to be a major issue for go on ppc64el in 14.04. Are you planning to SRU this fix in? Do you know the timeline when this might happen?

Thanks!

Changed in gccgo-4.9 (Ubuntu Trusty):
importance: Medium → High
status: Fix Committed → In Progress
Matthias Klose (doko) wrote :

I'd like to prepare that on Jul 17, based on 4.9.1, including all accumulated ABI fixes and warnings for ABI changes targeted for 4.10.

Patricia Gaughen (gaughen) wrote :

Matthias - does that mean it will show up in proposed on Jul 17? I think there are folks who would hope it would land in Trusty sooner.

Antonio Rosales (arosales) wrote :

@Matthias,

Thanks for the work on this. We are blocked running Juju on power8le. Will gcc be updated in the archives as of July 17, or go into proposed at that time?

-thanks,
Antonio

Matthias Klose (doko) wrote :

the gccgo-4.9 package is now updated and built in this PPA (not yet on all architectures). Please confirm that juju-core built with this package works, then we can copy the package including the binaries into -proposed.

Matthias Klose (doko) wrote :

for your convenience, juju-core is now built in this PPA using gccgo-4.9 4.9.1

Matt Bruzek (mbruzek) wrote :

Is there any way to verify if a Debian package has been built with this gccgo fix?

Procedure:
I downloaded the 14.04 debian packages from:
http://juju-ci.vapour.ws:8080/job/publish-revision/690/
These packages were built on stilson-07 which has the PPA and the fixed version of gccgo.

Then I ran apt-get upgrade/update
sudo dpkg -i *1.20.2*

juju bootstrap -v -e local –debug
juju deploy local:trusty/ubuntu
juju ssh ubuntu/0

Segmentation fault on stilson-01:
http://pastebin.ubuntu.com/7832678/

According to line 372 the dmesg output:
http://paste.ubuntu.com/7832700/
This appears to be the same problem, the deb packages were not built with the fix.

I am also able to reproduce this Segmentation fault on the CI machine stilson-06.
http://pastebin.ubuntu.com/7836776/

The CI build machine has the toolchain-r PPA and the fixed level of gccgo:
http://pastebin.ubuntu.com/7836812/

The dmesg output is attached.

Matthias Klose (doko) wrote :

Am 22.07.2014 18:05, schrieb Matt Bruzek:
> Is there any way to verify if a Debian package has been built with this
> gccgo fix?

sure, make sure to build-depend on gccgo-4.9 (>= 4.9.1).

> Procedure:
> I downloaded the 14.04 debian packages from:
> http://juju-ci.vapour.ws:8080/job/publish-revision/690/
> These packages were built on stilson-07 which has the PPA and the fixed version of gccgo.

I can't see this in the build logs.

> Then I ran apt-get upgrade/update
> sudo dpkg -i *1.20.2*
>
> juju bootstrap -v -e local –debug
> juju deploy local:trusty/ubuntu
> juju ssh ubuntu/0
>
> Segmentation fault on stilson-01:
> http://pastebin.ubuntu.com/7832678/
>
> According to line 372 the dmesg output:
> http://paste.ubuntu.com/7832700/
> This appears to be the same problem, the deb packages were not built with the fix.
>
> I am also able to reproduce this Segmentation fault on the CI machine stilson-06.
> http://pastebin.ubuntu.com/7836776/
>
> The CI build machine has the toolchain-r PPA and the fixed level of gccgo:
> http://pastebin.ubuntu.com/7836812/
>
> The dmesg output is attached.
>

according to the build log in
https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/ppa/+build/6192686

the package is built using gccgo-4.9 4.9.1-0ubuntu1.

Matt Bruzek (mbruzek) wrote :
Download full text (3.9 KiB)

I tried the package that Matthias left on the link in the last comment (1.18.1). I received a similar but different error:

                                                      goroutine 3 [syscall]:
                                                                                goroutine in C code;
stack unavailable

                 goroutine 10 [IO wait]:
                                        code.google.com_p_go.net_websocket.ReadByte.N57_code.google.c
om_p_go.net_websocket.hybiFrameReaderFactory
                                                /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:113
                            code.google.com_p_go.net_websocket.NewFrameReader.N57_code.google.com_p_go.net_websocket.hybiFrameReaderFactory
                                        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:126
                    code.google.com_p_go.net_websocket.Receive.N40_code.google.com_p_go.net_websocket.Codec
        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/websocket.go:314
                                                                                              launchpad.net_juju_core_rpc_jsoncodec.Receive.N48_launchpad.net_juju_core_rpc_jsoncodec.wsJSONConn
                                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/conn.go:25
                                                                              launchpad.net_juju_core_rpc_jsoncodec.ReadHeader.pN43_launchpad.net_juju_core_rpc_jsoncodec.Codec
                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/codec.go:113
                                                                launchpad.net_juju_core_rpc.loop.pN32_launchpad.net_juju_core_rpc.Conn
                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:344
               launchpad.net_juju_core_rpc.input.pN32_launchpad.net_juju_core_rpc.Conn
                                                                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:317
                                                               created by launchpad.net_juju_core_rpc.Start.pN32_launchpad.net_juju_core_rpc.Conn
                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:200
                       ubuntu@stilson-01:~$

ubuntu@stilson-01:~$ uname -a
Linux stilson-01 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:50:31 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@stilson-01:~$ dpkg -l | grep juju
ii juju 1.18.1-0ubuntu1.1 all next generation service orchestration system
ii juju-core 1.18.1-0ubuntu1.1 ppc64el Juju is devops distilled - client
ii juju-deployer 0.3.6-0ubuntu2 all Deploy complex stacks of ...

Read more...

Dave Cheney (dave-cheney) wrote :

This new failure looks different (based on the dmesg output), please
open a new issue.

On Wed, Jul 23, 2014 at 8:36 AM, Matt Bruzek
<email address hidden> wrote:
> ** Attachment added: "the dmesg output from stilson-01 with version 1.18.1 installed."
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+attachment/4160288/+files/dmesg_1.18_output.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> In Progress
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Antonio Rosales (arosales) wrote :

For reference the new bug Matt opened is:
https://bugs.launchpad.net/ubuntu/+source/juju-core/+bug/1347322

-thanks,
Antonio

Matthias Klose (doko) wrote :

I copied the gccgo-4.9 package from the ubuntu-toolchain-r/ppa package to trusty-proposed (now waiting for approval). The following testing was done:

 - libgcc1: the packages contains the shared library, exports the same symbols,
   no code changes were done for libgcc1 itself.

 - the testsuite doesn't show regressions on any architectures (although we should
   only be interested in regressions in gccgo and libgo; the package in trusty
   didn't ship a cc1, so we don't have any regressions for the C compiler).

 - The only package build-depending on gccgo in trusty (juju-core) was successfully rebuilt.

Matt Bruzek (mbruzek) wrote :

I believe the gccgo 4.9.1 compiler fix is a good thing. After rebuilding Juju with gccgo-4.9_4.9.1-1ubuntu3_ppc64el.deb I was unable to get the "juju ssh" problem I had seen previously. I also did some load testing on juju with multiple bootstrap, deploy, and destroy cycles without any juju related problems.

See https://bugs.launchpad.net/ubuntu/+source/juju-core/+bug/1347322 Comment #7 for the details of how we built with the gccgo 4.9.1 compiler.

Thanks.

Steve Langasek (vorlon) wrote :

On Fri, Jul 25, 2014 at 10:35:41PM -0000, Matt Bruzek wrote:
> I believe the gccgo 4.9.1 compiler fix is a good thing. After
> rebuilding Juju with gccgo-4.9_4.9.1-1ubuntu3_ppc64el.deb I was unable
> to get the "juju ssh" problem I had seen previously.

For purposes of this SRU bug, please verify the juju-core
1.18.4+dfsg-0ubuntu0.14.04.1 package in trusty-proposed, not a locally-built
juju-core package built using the compiler from utopic.

Matthias Klose (doko) wrote :

new version 4.9.1-13ubuntu1 copied to trusty-proposed, requested by mwhudson. waiting for review.

Tim Penhey (thumper) wrote :

I can verify that with the above package from trusty-proposed, I was able to bootstrap a local juju environment on rockne-02.

tags: added: verification-done
Tim Penhey (thumper) wrote :

However rockne-02 has a 4k page :-(

tags: removed: verification-done
Tim Penhey (thumper) wrote :

Gah, must have looked at the page size of my laptop instead:

ubuntu@rockne-02:~$ juju status
environment: local
machines:
  "0":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.18.4.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
services: {}
ubuntu@rockne-02:~$ uname -a
Linux rockne-02 3.13.0-18-generic #38-Ubuntu SMP Mon Mar 17 21:41:16 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@rockne-02:~$ getconf PAGESIZE
65536

tags: added: verification-done
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gccgo-4.9 - 4.9.1-0ubuntu1

---------------
gccgo-4.9 (4.9.1-0ubuntu1) trusty-proposed; urgency=medium

  * Upload the final GCC 4.9.1 release.
  * Merge changes from gcc-4.9 4.9.0-2ubuntu1, including:
    - Fix PR go/60931, garbage collector issue with non 4kB system page size.
      LP: #1304754.
    - Fix wrong-code issue in the little endian vector API (ppc64el).
      LP: #1311128.
    - Fix ABI incompatibility between POWER and Z HTM builtins and intrinsics.
      LP: #1320292.
    - Fix an ICE with invalid code. PR c++/61046. LP: #1313102.
    - gccgo: Don't overwrite memory if an archive has a bad file name.
  * Include the cc1 binary into the gccgo-4.9 package.
  * Do not build-depend on sdt-systemtap for the trusty upload.
  * Warn about ppc ELFv2 ABI issues, which will change in GCC 4.10.
 -- Matthias Klose <email address hidden> Thu, 17 Jul 2014 15:51:15 +0200

Changed in gccgo-4.9 (Ubuntu Trusty):
status: In Progress → Fix Released

The verification of the Stable Release Update for gccgo-4.9 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Hello Dave, or anyone else affected,

Accepted gccgo-4.9 into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/gccgo-4.9/4.9.3-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Steve Langasek (vorlon) wrote :

This change was already included in the previous sru of gccgo-4.9, 4.9.1-0ubuntu1; no re-verification is required.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.