Bug #1304754 “gccgo has issues when page size is not 4kB” : Bugs : gccgo-4.9 package : Ubuntu

Revision history for this message

Brad Figg (brad-figg) wrote on 2014-04-09: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1304754

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-09: Re: gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels

#2

See also, https://bugs.launchpad.net/juju-core/+bug/1303787

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2014-04-09:

#3

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-trusty/

Changed in linux (Ubuntu):
importance:	Undecided → Medium
tags:	added: kernel-da-key ppc64el trusty

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2014-04-09:

#4

If the bug still exists with the latest mainline kernel, we can perform a bisect to identify the fix that introduced this. However, if the mainline kernel resolves this bug, we can perform a "Reverse" bisect to identify the commit that fixes this.

tags:

added: performing-bisect

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-10:

#5

peano.go Edit (2.1 KiB, text/plain)

Based on the fail, I took a look at how gccgo handles stacks. It relies on the split stack feature in gold, which doesn't appear to be implemented for ppc64.

Running one of the go recursion testcases (attached) shows what happens when we run out of stack and don't have the split stack feature to save us:

#gccgo -g -O2 -o peano peano.go
# ./peano
Segmentation fault

And we get the setup_rt_frame error in dmesg:

peano[4538]: bad frame in setup_rt_frame: 000000c20ff7f000 nip 0000000010001018 lr 0000000010001024

As expected, we are just continually recurse without checking out stack pointer for overflow:

   0x0000000010001008 <+8>: cmpdi r3,0
   0x000000001000100c <+12>: beq 0x10001040 <main.count+64>
   0x0000000010001010 <+16>: mflr r0
   0x0000000010001014 <+20>: std r0,16(r1)
   0x0000000010001018 <+24>: stdu r1,-32(r1)
   0x000000001000101c <+28>: ld r3,0(r3)
   0x0000000010001020 <+32>: bl 0x10001008 <main.count+8>

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-10: Re: [Bug 1304754] Re: gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels

#6

Thanks Anton, this is great debugging.

I tried the peano experiment on my -8 (4k) kernel and it failed as expected.

I talked to the upstream who said that ./configure should detect that
-fsplit-stack isn't supported on PPC and fall back to giving each
goroutine a full stack.

I will investigate this today.

With this said, should this bug be reassiged to gccgo (trusty) ?

On Thu, Apr 10, 2014 at 8:44 PM, Anton Blanchard <email address hidden> wrote:
> Based on the fail, I took a look at how gccgo handles stacks. It relies
> on the split stack feature in gold, which doesn't appear to be
> implemented for ppc64.
>
> Running one of the go recursion testcases (attached) shows what happens
> when we run out of stack and don't have the split stack feature to save
> us:
>
> #gccgo -g -O2 -o peano peano.go
> # ./peano
> Segmentation fault
>
> And we get the setup_rt_frame error in dmesg:
>
> peano[4538]: bad frame in setup_rt_frame: 000000c20ff7f000 nip
> 0000000010001018 lr 0000000010001024
>
> As expected, we are just continually recurse without checking out stack
> pointer for overflow:
>
> 0x0000000010001008 <+8>: cmpdi r3,0
> 0x000000001000100c <+12>: beq 0x10001040 <main.count+64>
> 0x0000000010001010 <+16>: mflr r0
> 0x0000000010001014 <+20>: std r0,16(r1)
> 0x0000000010001018 <+24>: stdu r1,-32(r1)
> 0x000000001000101c <+28>: ld r3,0(r3)
> 0x0000000010001020 <+32>: bl 0x10001008 <main.count+8>
>
>
> ** Attachment added: "peano.go"
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+attachment/4079310/+files/peano.go
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels
>
> Status in "linux" package in Ubuntu:
> Incomplete
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+subscriptions

Thanks Anton, this is great debugging.

I tried the peano experiment on my -8 (4k) kernel and it failed as expected.

I talked to the upstream who said that ./configure should detect that
-fsplit-stack isn't supported on PPC and fall back to giving each
goroutine a full stack.

I will investigate this today.

With this said, should this bug be reassiged to gccgo (trusty) ?

On Thu, Apr 10, 2014 at 8:44 PM, Anton Blanchard <anton@samba.org> wrote:
> Based on the fail, I took a look at how gccgo handles stacks. It relies
> on the split stack feature in gold, which doesn't appear to be
> implemented for ppc64.
>
> Running one of the go recursion testcases (attached) shows what happens
> when we run out of stack and don't have the split stack feature to save
> us:
>
> #gccgo -g -O2 -o peano peano.go
> # ./peano
> Segmentation fault
>
> And we get the setup_rt_frame error in dmesg:
>
> peano[4538]: bad frame in setup_rt_frame: 000000c20ff7f000 nip
> 0000000010001018 lr 0000000010001024
>
> As expected, we are just continually recurse without checking out stack
> pointer for overflow:
>
>    0x0000000010001008 <+8>:     cmpdi   r3,0
>    0x000000001000100c <+12>:    beq     0x10001040 <main.count+64>
>    0x0000000010001010 <+16>:    mflr    r0
>    0x0000000010001014 <+20>:    std     r0,16(r1)
>    0x0000000010001018 <+24>:    stdu    r1,-32(r1)
>    0x000000001000101c <+28>:    ld      r3,0(r3)
>    0x0000000010001020 <+32>:    bl      0x10001008 <main.count+8>
>
>
> ** Attachment added: "peano.go"
>    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+attachment/4079310/+files/peano.go
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels
>
> Status in "linux" package in Ubuntu:
>   Incomplete
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1304754/+subscriptions

Revision history for this message

Adam Conrad (adconrad) wrote on 2014-04-12: Re: gccgo on ppc64el using split stacks when not supported

#7

FWIW, my guess is that the reason this appears to be "fixed" when dropping to a smaller pagesize is simply because you're exhausting the stack more slowly. Given that it takes a long while reproduce with juju (or, so I've been told), it does also beg the question if juju has a slow memory leak.

affects:	linux (Ubuntu) → gccgo-4.9 (Ubuntu)
Changed in gccgo-4.9 (Ubuntu):
status:	Incomplete → Confirmed
summary:	- gccgo compiled binaries are killed by SEGV on 64k ppc64el kernels + gccgo on ppc64el using split stacks when not supported
tags:	removed: kernel-da-key performing-bisect

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-14:

#8

Anton:

I've done some experiments with the peano.go test and confirmed that gccgo on ppc is correctly configured to not use f-split-stack. It turns out the peano.go can't pass without split stacks. On gccgo/ppc64 the program crashes at a stack depth of

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fffb7722220 (LWP 24713)]
0x0000000010004e0c in main.is_zero ()
(gdb) bt
#0 0x0000000010004e0c in main.is_zero ()
#1 0x00000000100051fc in main.count ()
#2 0x000000001000522c in main.count ()
...
#31380 0x000000001000522c in main.count ()
#31381 0x0000000010005854 in main.main ()

I think the peano example is just a straght 'fall off the stack' type error, it also generates a slightly different
ubuntu@winton-02:~/go/test$ ./a.out
Segmentation fault (core dumped)
ubuntu@winton-02:~/go/test$ dmesg | tail -n1
[501663.078093] a.out[25679]: bad frame in setup_rt_frame: 000000c20ffaf0e0 nip 0000000010004e0c lr 00000000100051fc

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-16:

#9

I've made some progress with these fails. A lot of the confusion is around the way gccgo hooks the SEGV handler and attempts to backtrace all goroutines (the code is in runtime_tracebackothers())

It does this by calling runtime_gogo() which temporarily switches to the goroutine using setcontext(). If the context is bad in any way, this will cause us to SEGV again. I printed out the stack pointer (r1) and the NIA during this stack backtracing, and we see where things go south just as we are about to dump goroutine 0:

goroutine 0 [idle]:
DEBUG: runtime_gogo r1 0 nia 0

r1 = 0, nia = 0. When we call setcontext on this invalid context we die with:

jujud[5258]: bad frame in setup_rt_frame: 0000000000000000 nip 0000000000000000 lr 0000000000000000

Perhaps we aren't saving away the context for goroutine 0 correctly.

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-16:

#10

This doesn't explain why we failed in the first place however. Using gdb, I have seen a couple of SEGVs in:

* 1 Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc (dummy=<optimized out>) at ../../../gcc/libgo/runtime/time.goc:217

ie:

f = (void*)t->fv->fn;

Perhaps a stale timer that we aren't cancelling?

I've also seen a fail here:

fatal error: runtime_lock: lock count

goroutine 2 [running]:
runtime_dopanic
        ../../../gcc/libgo/runtime/panic.c:78
runtime_throw
        ../../../gcc/libgo/runtime/panic.c:116
runtime_lock
        ../../../gcc/libgo/runtime/lock_futex.c:41
runtime_allocmcache
        ../../../gcc/libgo/runtime/malloc.goc:337
runtime_startpanic
        ../../../gcc/libgo/runtime/panic.c:46
runtime_throw
        ../../../gcc/libgo/runtime/panic.c:114
runtime_unlock
        ../../../gcc/libgo/runtime/lock_futex.c:101
runtime_MHeap_Scavenger
        ../../../gcc/libgo/runtime/mheap.c:482
kickoff
        ../../../gcc/libgo/runtime/proc.c:237

:0

:0
created by runtime_main
../../../gcc/libgo/runtime/proc.c:565

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-16: Re: [Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported

#11

On Wed, Apr 16, 2014 at 4:26 PM, Anton Blanchard <email address hidden> wrote:
> I've made some progress with these fails. A lot of the confusion is
> around the way gccgo hooks the SEGV handler and attempts to backtrace
> all goroutines (the code is in runtime_tracebackothers())
>
> It does this by calling runtime_gogo() which temporarily switches to the
> goroutine using setcontext(). If the context is bad in any way, this
> will cause us to SEGV again. I printed out the stack pointer (r1) and
> the NIA during this stack backtracing, and we see where things go south
> just as we are about to dump goroutine 0:
>
> goroutine 0 [idle]:
> DEBUG: runtime_gogo r1 0 nia 0
>
> r1 = 0, nia = 0. When we call setcontext on this invalid context we die
> with:
>
> jujud[5258]: bad frame in setup_rt_frame: 0000000000000000 nip
> 0000000000000000 lr 0000000000000000
>
> Perhaps we aren't saving away the context for goroutine 0 correctly.

Hmm, could be. It looks like the process was crashing anyway.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

On Wed, Apr 16, 2014 at 4:26 PM, Anton Blanchard <anton@samba.org> wrote:
> I've made some progress with these fails. A lot of the confusion is
> around the way gccgo hooks the SEGV handler and attempts to backtrace
> all goroutines (the code is in runtime_tracebackothers())
>
> It does this by calling runtime_gogo() which temporarily switches to the
> goroutine using setcontext(). If the context is bad in any way, this
> will cause us to SEGV again. I printed out the stack pointer (r1) and
> the NIA during this stack backtracing, and we see where things go south
> just as we are about to dump goroutine 0:
>
> goroutine 0 [idle]:
> DEBUG: runtime_gogo r1 0 nia 0
>
> r1 = 0, nia = 0. When we call setcontext on this invalid context we die
> with:
>
> jujud[5258]: bad frame in setup_rt_frame: 0000000000000000 nip
> 0000000000000000 lr 0000000000000000
>
> Perhaps we aren't saving away the context for goroutine 0 correctly.

Hmm, could be. It looks like the process was crashing anyway.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
>   Confirmed
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-16:

#12

Hi Anton,

I've been looking at another angle via a different crash. I see a
crash if a child process gets a signal, which sort of reflects back on
the parent.

Are there any alignment requirements for signal handling on 64k kernels ?

Dave

On Wed, Apr 16, 2014 at 4:28 PM, Anton Blanchard <email address hidden> wrote:
> This doesn't explain why we failed in the first place however. Using
> gdb, I have seen a couple of SEGVs in:
>
> * 1 Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc
> (dummy=<optimized out>) at ../../../gcc/libgo/runtime/time.goc:217
>
> ie:
>
> f = (void*)t->fv->fn;
>
> Perhaps a stale timer that we aren't cancelling?
>
> I've also seen a fail here:
>
> fatal error: runtime_lock: lock count
>
> goroutine 2 [running]:
> runtime_dopanic
> ../../../gcc/libgo/runtime/panic.c:78
> runtime_throw
> ../../../gcc/libgo/runtime/panic.c:116
> runtime_lock
> ../../../gcc/libgo/runtime/lock_futex.c:41
> runtime_allocmcache
> ../../../gcc/libgo/runtime/malloc.goc:337
> runtime_startpanic
> ../../../gcc/libgo/runtime/panic.c:46
> runtime_throw
> ../../../gcc/libgo/runtime/panic.c:114
> runtime_unlock
> ../../../gcc/libgo/runtime/lock_futex.c:101
> runtime_MHeap_Scavenger
> ../../../gcc/libgo/runtime/mheap.c:482
> kickoff
> ../../../gcc/libgo/runtime/proc.c:237
>
> :0
>
> :0
> created by runtime_main
> ../../../gcc/libgo/runtime/proc.c:565
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Hi Anton,

I've been looking at another angle via a different crash. I see a
crash if a child process gets a signal, which sort of reflects back on
the parent.

Are there any alignment requirements for signal handling on 64k kernels ?

Dave

On Wed, Apr 16, 2014 at 4:28 PM, Anton Blanchard <anton@samba.org> wrote:
> This doesn't explain why we failed in the first place however. Using
> gdb, I have seen a couple of SEGVs in:
>
> * 1    Thread 0x3fffa8c447e0 (LWP 5562) "jujud" timerproc
> (dummy=<optimized out>) at ../../../gcc/libgo/runtime/time.goc:217
>
> ie:
>
>                         f = (void*)t->fv->fn;
>
> Perhaps a stale timer that we aren't cancelling?
>
> I've also seen a fail here:
>
> fatal error: runtime_lock: lock count
>
> goroutine 2 [running]:
> runtime_dopanic
>         ../../../gcc/libgo/runtime/panic.c:78
> runtime_throw
>         ../../../gcc/libgo/runtime/panic.c:116
> runtime_lock
>         ../../../gcc/libgo/runtime/lock_futex.c:41
> runtime_allocmcache
>         ../../../gcc/libgo/runtime/malloc.goc:337
> runtime_startpanic
>         ../../../gcc/libgo/runtime/panic.c:46
> runtime_throw
>         ../../../gcc/libgo/runtime/panic.c:114
> runtime_unlock
>         ../../../gcc/libgo/runtime/lock_futex.c:101
> runtime_MHeap_Scavenger
>         ../../../gcc/libgo/runtime/mheap.c:482
> kickoff
>         ../../../gcc/libgo/runtime/proc.c:237
>
>         :0
>
>         :0
> created by runtime_main
>         ../../../gcc/libgo/runtime/proc.c:565
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
>   Confirmed
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-16: Re: gccgo on ppc64el using split stacks when not supported

#13

There shouldn't be any difference in terms of signal handling.

I've now seen a couple of failures in mongodb/TLS networking code:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x38]

goroutine 16 [running]:
crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn
        ../../../gcc/libgo/go/crypto/tls/conn.go:111
labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273
labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474
labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320
labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278
created by mgo.newServer
        /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80

which is:

func (c *Conn) SetWriteDeadline(t time.Time) error {
return c.conn.SetWriteDeadline(t)
}

SetWriteDeadline will end up in timer code, and I've previously seen failures in the timer code.

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-04-16: Re: [Bug 1304754] Re: gccgo on ppc64el using split stacks when not supported

#14

An excellent point. Timers are managed by a single goroutine and a
priority queue of events to wait on and channels to send the timer
event. It should be doable to write some code that stresses timers.

However I don't believe that SIGALARM is used, well at least not in gc
which most of the gccgo standard library extends from, gccgo might be
slightly different.

The event that crashes the go process is related to a watchdog timer
that expires and tries to kill the subprocess.

On Wed, Apr 16, 2014 at 6:04 PM, Anton Blanchard <email address hidden> wrote:
> There shouldn't be any difference in terms of signal handling.
>
> I've now seen a couple of failures in mongodb/TLS networking code:
>
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal 0xb code=0x1 addr=0x38]
>
> goroutine 16 [running]:
> crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn
> ../../../gcc/libgo/go/crypto/tls/conn.go:111
> labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273
> labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474
> labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320
> labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278
> created by mgo.newServer
> /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80
>
> which is:
>
> func (c *Conn) SetWriteDeadline(t time.Time) error {
> return c.conn.SetWriteDeadline(t)
> }
>
> SetWriteDeadline will end up in timer code, and I've previously seen
> failures in the timer code.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
> Confirmed
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

An excellent point. Timers are managed by a single goroutine and a
priority queue of events to wait on and channels to send the timer
event. It should be doable to write some code that stresses timers.

However I don't believe that SIGALARM is used, well at least not in gc
which most of the gccgo standard library extends from, gccgo might be
slightly different.

The event that crashes the go process is related to a watchdog timer
that expires and tries to kill the subprocess.

On Wed, Apr 16, 2014 at 6:04 PM, Anton Blanchard <anton@samba.org> wrote:
> There shouldn't be any difference in terms of signal handling.
>
> I've now seen a couple of failures in mongodb/TLS networking code:
>
> panic: runtime error: invalid memory address or nil pointer dereference
> [signal 0xb code=0x1 addr=0x38]
>
> goroutine 16 [running]:
> crypto_tls.SetWriteDeadline.pN15_crypto_tls.Conn
>         ../../../gcc/libgo/go/crypto/tls/conn.go:111
> labix.org_v2_mgo.updateDeadline.pN28_labix.org_v2_mgo.mongoSocket
>         /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:273
> labix.org_v2_mgo.Query.pN28_labix.org_v2_mgo.mongoSocket
>         /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:474
> labix.org_v2_mgo.SimpleQuery.pN28_labix.org_v2_mgo.mongoSocket
>         /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/socket.go:320
> labix.org_v2_mgo.pinger.pN28_labix.org_v2_mgo.mongoServer
>         /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:278
> created by mgo.newServer
>         /home/anton/juju-core-1.18.1/src/labix.org/v2/mgo/server.go:80
>
> which is:
>
> func (c *Conn) SetWriteDeadline(t time.Time) error {
>         return c.conn.SetWriteDeadline(t)
> }
>
> SetWriteDeadline will end up in timer code, and I've previously seen
> failures in the timer code.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo on ppc64el using split stacks when not supported
>
> Status in “gccgo-4.9” package in Ubuntu:
>   Confirmed
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+subscriptions

Revision history for this message

In GCC Bugzilla #60931, Anton Blanchard (anton-samba) wrote on 2014-04-23:

#16

Created attachment 32659
Bump page size to 64kB

We are seeing random failures with go programs on a 64kB page size ppc64 box. It looks like garbage collection issues - sometimes we SEGV in timer code, sometimes we SEGV in the code that wraps a kernel read syscall. If I prevent the garbage collector from running, the programs work.

The libgo malloc hard codes the page size so I wrote a quick hack to bump this (and a few other dependent variables). This makes the problem go away, but we will need to come up with a better way to do this at runtime.

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-23: Re: gccgo on ppc64el using split stacks when not supported

#15

Hi Dave,

It does look like a page size issue. I submitted the following bug:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60931

Revision history for this message

In GCC Bugzilla #60931, Pinskia (pinskia) wrote on 2014-04-23:

#17

This is going to be true on AARCH64 also where most distros are going to be using 64k pages (some might use 4k pages if they also support AARCH32). MIPS has many different page sizes too (4k, 8k, 16k, 32k, and 64k). So hard coding the page size seems wrong, maybe you should call getpagesize instead.

Revision history for this message

In GCC Bugzilla #60931, Anton Blanchard (anton-samba) wrote on 2014-04-23:

#18

I agree, but when I tried this I found a few places that expect PageSize to be a compile time constant so it is not as trivial as I had hoped.

Revision history for this message

In GCC Bugzilla #60931, Ian Lance Taylor (ianlancetaylor) wrote on 2014-04-23:

#19

It would be extremely helpful if you could find a test case that can recreate this problem with some reliability. There is no obvious dependency on the system page size in libgo. The PageSize constant is the unit that the memory allocator deals in, and should have no direct relationship to the system page size. I believe that there is a bug, but we need to track it down.

If you set the environment variable GOGC=1 the garbage collector will run much more frequently; perhaps that will help get a reproducible test case.

Revision history for this message

In GCC Bugzilla #60931, Anton Blanchard (anton-samba) wrote on 2014-04-24:

#20

Created attachment 32669
Don't use madvise(DONT_NEED) on sub pages

Revision history for this message

In GCC Bugzilla #60931, Anton Blanchard (anton-samba) wrote on 2014-04-24:

#21

I think I see it:

19112 madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

That 4kB madvise(MADV_DONTNEED) gets rounded up to the system page size of 64kB and we end up covering still in use memory.

The following patch fixes it for me, but it just ignores any sub pages. We should keep them around so later calls have a chance at consolidating regions up to a system page size.

Revision history for this message

In GCC Bugzilla #60931, Jakub-gcc (jakub-gcc) wrote on 2014-04-24:

#22

Perhaps it would be better instead of not doing the madvise at all if start or length isn't page aligned round the start to the next page boundary and end to the previous page boundary and madvise if the rounded end is above the rounded start.

Bug Watch Updater (bug-watch-updater) on 2014-04-24

Changed in gcc:
importance:	Unknown → Medium
status:	Unknown → New

Revision history for this message

In GCC Bugzilla #60931, Anton Blanchard (anton-samba) wrote on 2014-04-25:

#25

Created attachment 32679
runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages. Most of
the code checks getpagesize() at runtime and does the right thing.

On a 64kB ppc64 box I see SEGVs in long running processes which has
been diagnosed as a bug in scavengelist. scavengelist does a
madvise(MADV_DONTNEED) without rounding the arguments to the system
page size. A strace of one of the failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of valid data
as no longer needed.

Round start up to a system page and end down before calling madvise.

Revision history for this message

In GCC Bugzilla #60931, I-ian-1 (i-ian-1) wrote on 2014-04-25:

#26

Author: ian
Date: Fri Apr 25 04:28:48 2014
New Revision: 209776

URL: http://gcc.gnu.org/viewcvs?rev=209776&root=gcc&view=rev
Log:
PR go/60931

runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages.
Most of the code checks getpagesize() at runtime and does the
right thing.

On a 64kB ppc64 box I see SEGVs in long running processes
which has been diagnosed as a bug in scavengelist.
scavengelist does a madvise(MADV_DONTNEED) without rounding
the arguments to the system page size. A strace of one of the
failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of
valid data as no longer needed.

Round start up to a system page and end down before calling
madvise.

Modified:
branches/gcc-4_9-branch/libgo/runtime/mheap.c

Revision history for this message

In GCC Bugzilla #60931, I-ian-1 (i-ian-1) wrote on 2014-04-25:

#27

Author: ian
Date: Fri Apr 25 04:29:07 2014
New Revision: 209777

URL: http://gcc.gnu.org/viewcvs?rev=209777&root=gcc&view=rev
Log:
PR go/60931

runtime: Fix garbage collector issue with non 4kB system page size

The go garbage collector tracks memory in terms of 4kB pages.
Most of the code checks getpagesize() at runtime and does the
right thing.

On a 64kB ppc64 box I see SEGVs in long running processes
which has been diagnosed as a bug in scavengelist.
scavengelist does a madvise(MADV_DONTNEED) without rounding
the arguments to the system page size. A strace of one of the
failures shows the problem:

madvise(0xc211030000, 4096, MADV_DONTNEED) = 0

The kernel rounds the length up to 64kB and we mark 60kB of
valid data as no longer needed.

Round start up to a system page and end down before calling
madvise.

Modified:
trunk/libgo/runtime/mheap.c

Revision history for this message

In GCC Bugzilla #60931, Ian Lance Taylor (ianlancetaylor) wrote on 2014-04-25:

#28

Thanks for the patch. I committed a version of it to mainline and 4.9 branch.

Anton Blanchard (anton-samba) on 2014-04-25

summary:

- gccgo on ppc64el using split stacks when not supported
+ gccgo has issues when page size is not 4kB

Revision history for this message

Anton Blanchard (anton-samba) wrote on 2014-04-25:

#23

A fix has made it into mainline and the 4.9 branch:

http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=209776

Matthias Klose (doko) on 2014-04-25

Changed in gcc-4.9 (Ubuntu Trusty):
status:	New → Invalid
Changed in gccgo-4.9 (Ubuntu Trusty):
importance:	Undecided → Medium
milestone:	none → trusty-updates
status:	New → Confirmed
Changed in gccgo-4.9 (Ubuntu Utopic):
status:	Confirmed → Invalid
Changed in gcc-4.9 (Ubuntu Utopic):
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Matthias Klose (doko) wrote on 2014-04-25:

#24

the fix is now in gccgo-4.9 in the doko/toolchain PPA (for trusty).

Bug Watch Updater (bug-watch-updater) on 2014-04-26

Changed in gcc:
status:	New → Fix Released

Revision history for this message

Matthias Klose (doko) wrote on 2014-04-28:

#29

now fixed in gcc-4.9 in utopic

Changed in gcc-4.9 (Ubuntu Utopic):
status:	Confirmed → Fix Released

Revision history for this message

Matthias Klose (doko) wrote on 2014-05-03:

#30

updated package in the ubuntu-toolchain-r/ppa PPA, removed the one in the doko/toolchain PPA.

Changed in gccgo-4.9 (Ubuntu Trusty):
assignee:	nobody → Matthias Klose (doko)

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-05-13: Re: [Bug 1304754] Re: gccgo has issues when page size is not 4kB

#31

I've been testing this, it's looking good so far. I'd like to run one
more test overnight before giving a thumbs up/down.

On Sun, May 4, 2014 at 2:29 PM, Launchpad Bug Tracker
<email address hidden> wrote:
> ** Branch linked: lp:debian/gcc-4.9
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> Confirmed
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-05-14:

#32

Matthias, I've tested this fix with Juju and it looks like it has fixed the problem with 64k kernels.

I've moved this to fix committed, I hope this is the correct status.

Changed in gccgo-4.9 (Ubuntu Trusty):
status:	Confirmed → Fix Committed

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-07-10:

#34

Hi Matt,

Can you please post the output of dmesg, that is the canonical way to
diagnose this issue atm.

On Thu, Jul 10, 2014 at 5:23 AM, Matt Bruzek
<email address hidden> wrote:
> I installed the debian packages from the CI server http://juju-ci.vapour.ws:8080/job/publish-revision/588/
> My understanding is these deb packages were built with the PPA toolchain that has the fix installed.
>
> I destroyed the environment, rebooted the machine (for good measure) and
> find that I still get an error.
>
> https://pastebin.canonical.com/113210/
>
> Are we sure this fixes the problem?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> Fix Committed
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Hi Matt,

Can you please post the output of dmesg, that is the canonical way to
diagnose this issue atm.

On Thu, Jul 10, 2014 at 5:23 AM, Matt Bruzek
<matthew.bruzek@canonical.com> wrote:
> I installed the debian packages from the CI server http://juju-ci.vapour.ws:8080/job/publish-revision/588/
> My understanding is these deb packages were built with the PPA toolchain that has the fix installed.
>
> I destroyed the environment, rebooted the machine (for good measure) and
> find that I still get an error.
>
> https://pastebin.canonical.com/113210/
>
> Are we sure this fixes the problem?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
>   Fix Released
> Status in “gcc-4.9” package in Ubuntu:
>   Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
>   Invalid
> Status in “gcc-4.9” source package in Trusty:
>   Invalid
> Status in “gccgo-4.9” source package in Trusty:
>   Fix Committed
> Status in “gcc-4.9” source package in Utopic:
>   Fix Released
> Status in “gccgo-4.9” source package in Utopic:
>   Invalid
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-10:

#35

The output from dmesg on stilson-01 after installing the deb packages. Edit (25.5 KiB, text/plain)

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-07-11:

#36

Moving this back to "in progress". Matthias, this seems to be a major issue for go on ppc64el in 14.04. Are you planning to SRU this fix in? Do you know the timeline when this might happen?

Thanks!

Changed in gccgo-4.9 (Ubuntu Trusty):
importance:	Medium → High
status:	Fix Committed → In Progress

Revision history for this message

Matthias Klose (doko) wrote on 2014-07-11:

#37

I'd like to prepare that on Jul 17, based on 4.9.1, including all accumulated ABI fixes and warnings for ABI changes targeted for 4.10.

Revision history for this message

Patricia Gaughen (gaughen) wrote on 2014-07-11:

#38

Matthias - does that mean it will show up in proposed on Jul 17? I think there are folks who would hope it would land in Trusty sooner.

Revision history for this message

Antonio Rosales (arosales) wrote on 2014-07-14:

#39

@Matthias,

Thanks for the work on this. We are blocked running Juju on power8le. Will gcc be updated in the archives as of July 17, or go into proposed at that time?

-thanks,
Antonio

Revision history for this message

Matthias Klose (doko) wrote on 2014-07-17:

#40

the gccgo-4.9 package is now updated and built in this PPA (not yet on all architectures). Please confirm that juju-core built with this package works, then we can copy the package including the binaries into -proposed.

Revision history for this message

Matthias Klose (doko) wrote on 2014-07-17:

#41

for your convenience, juju-core is now built in this PPA using gccgo-4.9 4.9.1

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-22:

#42

The dmesg output from stilson-06 (CI) machine. Edit (122.6 KiB, text/plain)

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-22:

#43

Is there any way to verify if a Debian package has been built with this gccgo fix?

Procedure:
I downloaded the 14.04 debian packages from:
http://juju-ci.vapour.ws:8080/job/publish-revision/690/
These packages were built on stilson-07 which has the PPA and the fixed version of gccgo.

Then I ran apt-get upgrade/update
sudo dpkg -i *1.20.2*

juju bootstrap -v -e local –debug
juju deploy local:trusty/ubuntu
juju ssh ubuntu/0

Segmentation fault on stilson-01:
http://pastebin.ubuntu.com/7832678/

According to line 372 the dmesg output:
http://paste.ubuntu.com/7832700/
This appears to be the same problem, the deb packages were not built with the fix.

I am also able to reproduce this Segmentation fault on the CI machine stilson-06.
http://pastebin.ubuntu.com/7836776/

The CI build machine has the toolchain-r PPA and the fixed level of gccgo:
http://pastebin.ubuntu.com/7836812/

The dmesg output is attached.

Revision history for this message

Matthias Klose (doko) wrote on 2014-07-22:

#44

Am 22.07.2014 18:05, schrieb Matt Bruzek:
> Is there any way to verify if a Debian package has been built with this
> gccgo fix?

sure, make sure to build-depend on gccgo-4.9 (>= 4.9.1).

> Procedure:
> I downloaded the 14.04 debian packages from:
> http://juju-ci.vapour.ws:8080/job/publish-revision/690/
> These packages were built on stilson-07 which has the PPA and the fixed version of gccgo.

I can't see this in the build logs.

> Then I ran apt-get upgrade/update
> sudo dpkg -i *1.20.2*
>
> juju bootstrap -v -e local –debug
> juju deploy local:trusty/ubuntu
> juju ssh ubuntu/0
>
> Segmentation fault on stilson-01:
> http://pastebin.ubuntu.com/7832678/
>
> According to line 372 the dmesg output:
> http://paste.ubuntu.com/7832700/
> This appears to be the same problem, the deb packages were not built with the fix.
>
> I am also able to reproduce this Segmentation fault on the CI machine stilson-06.
> http://pastebin.ubuntu.com/7836776/
>
> The CI build machine has the toolchain-r PPA and the fixed level of gccgo:
> http://pastebin.ubuntu.com/7836812/
>
> The dmesg output is attached.
>

according to the build log in
https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/ppa/+build/6192686

the package is built using gccgo-4.9 4.9.1-0ubuntu1.

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-22:

#45

Download full text (3.9 KiB)

I tried the package that Matthias left on the link in the last comment (1.18.1). I received a similar but different error:

goroutine 3 [syscall]:
goroutine in C code;
stack unavailable

                 goroutine 10 [IO wait]:
                                        code.google.com_p_go.net_websocket.ReadByte.N57_code.google.c
om_p_go.net_websocket.hybiFrameReaderFactory
                                                /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:113
                            code.google.com_p_go.net_websocket.NewFrameReader.N57_code.google.com_p_go.net_websocket.hybiFrameReaderFactory
                                        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:126
                    code.google.com_p_go.net_websocket.Receive.N40_code.google.com_p_go.net_websocket.Codec
        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/websocket.go:314
                                                                                              launchpad.net_juju_core_rpc_jsoncodec.Receive.N48_launchpad.net_juju_core_rpc_jsoncodec.wsJSONConn
                                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/conn.go:25
                                                                              launchpad.net_juju_core_rpc_jsoncodec.ReadHeader.pN43_launchpad.net_juju_core_rpc_jsoncodec.Codec
                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/codec.go:113
                                                                launchpad.net_juju_core_rpc.loop.pN32_launchpad.net_juju_core_rpc.Conn
                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:344
               launchpad.net_juju_core_rpc.input.pN32_launchpad.net_juju_core_rpc.Conn
                                                                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:317
                                                               created by launchpad.net_juju_core_rpc.Start.pN32_launchpad.net_juju_core_rpc.Conn
                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:200
                       ubuntu@stilson-01:~$

ubuntu@stilson-01:~$ uname -a
Linux stilson-01 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:50:31 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@stilson-01:~$ dpkg -l | grep juju
ii juju 1.18.1-0ubuntu1.1 all next generation service orchestration system
ii juju-core 1.18.1-0ubuntu1.1 ppc64el Juju is devops distilled - client
ii juju-deployer 0.3.6-0ubuntu2 all Deploy complex stacks of ...

I tried the package that Matthias left on the link in the last comment (1.18.1).  I received a similar but different error:

goroutine 3 [syscall]:
                                                                                goroutine in C code; 
stack unavailable

goroutine 10 [IO wait]:
                                        code.google.com_p_go.net_websocket.ReadByte.N57_code.google.c
om_p_go.net_websocket.hybiFrameReaderFactory
                                                /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:113
                            code.google.com_p_go.net_websocket.NewFrameReader.N57_code.google.com_p_go.net_websocket.hybiFrameReaderFactory
                                        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/hybi.go:126
                    code.google.com_p_go.net_websocket.Receive.N40_code.google.com_p_go.net_websocket.Codec
        /build/buildd/juju-core-1.18.1/src/code.google.com/p/go.net/websocket/websocket.go:314
                                                                                              launchpad.net_juju_core_rpc_jsoncodec.Receive.N48_launchpad.net_juju_core_rpc_jsoncodec.wsJSONConn
                                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/conn.go:25
                                                                              launchpad.net_juju_core_rpc_jsoncodec.ReadHeader.pN43_launchpad.net_juju_core_rpc_jsoncodec.Codec
                                                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/jsoncodec/codec.go:113
                                                                launchpad.net_juju_core_rpc.loop.pN32_launchpad.net_juju_core_rpc.Conn
                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:344
               launchpad.net_juju_core_rpc.input.pN32_launchpad.net_juju_core_rpc.Conn
                                                                                        /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:317
                                                               created by launchpad.net_juju_core_rpc.Start.pN32_launchpad.net_juju_core_rpc.Conn
                                                /build/buildd/juju-core-1.18.1/src/launchpad.net/juju-core/rpc/server.go:200
                       ubuntu@stilson-01:~$

ubuntu@stilson-01:~$ uname -a
Linux stilson-01 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:50:31 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@stilson-01:~$ dpkg -l | grep juju
ii  juju                                   1.18.1-0ubuntu1.1                 all          next generation service orchestration system
ii  juju-core                              1.18.1-0ubuntu1.1                 ppc64el      Juju is devops distilled - client
ii  juju-deployer                          0.3.6-0ubuntu2                    all          Deploy complex stacks of services using Juju
ii  juju-jitsu                             0.20-1                            all          external tools to enhance juju
ii  juju-local                             1.18.1-0ubuntu1.1                 all          dependency package for the Juju local provider
ii  juju-mongodb                           2.4.9-0ubuntu3                    ppc64el      MongoDB object/document-oriented database for Juju
ii  juju-quickstart                        1.4.1+bzr88+ppa25~ubuntu14.04.1   all          Easy configuration of Juju environments
ii  python-jujuclient                      0.17.5-0ubuntu2                   all          Python API client for juju
ubuntu@stilson-01:~$ getconf PAGE_SIZE
65536

The dmesg output looks different will attach that as well.

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-22:

#46

the dmesg output from stilson-01 with version 1.18.1 installed. Edit (16.6 KiB, text/plain)

Revision history for this message

Dave Cheney (dave-cheney) wrote on 2014-07-22:

#47

This new failure looks different (based on the dmesg output), please
open a new issue.

On Wed, Jul 23, 2014 at 8:36 AM, Matt Bruzek
<email address hidden> wrote:
> ** Attachment added: "the dmesg output from stilson-01 with version 1.18.1 installed."
> https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+attachment/4160288/+files/dmesg_1.18_output.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
> gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
> Fix Released
> Status in “gcc-4.9” package in Ubuntu:
> Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
> Invalid
> Status in “gcc-4.9” source package in Trusty:
> Invalid
> Status in “gccgo-4.9” source package in Trusty:
> In Progress
> Status in “gcc-4.9” source package in Utopic:
> Fix Released
> Status in “gccgo-4.9” source package in Utopic:
> Invalid
>
> Bug description:
> On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
> killing gccgo compiled binaries
>
> [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
> 0000000000000000 nip 0000000000000000 lr 0000000000000000
> [18519.673632] init: juju-agent-ubuntu-local main process (19220)
> killed by SEGV signal
> [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
> In powerpc/kernel/signal_64.c:
>
> sys_rt_sigreturn is jumping to the badframe: label and executing an
> unconditional force_sigsegv which is delivered to the userland
> process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
> access and blame some random function that happened to be the top
> stack frame.
>
> Reverting to the 3.13-08 kernel appears to resolve the issue which
> (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

This new failure looks different (based on the dmesg output), please
open a new issue.

On Wed, Jul 23, 2014 at 8:36 AM, Matt Bruzek
<matthew.bruzek@canonical.com> wrote:
> ** Attachment added: "the dmesg output from stilson-01 with version 1.18.1 installed."
>    https://bugs.launchpad.net/ubuntu/+source/gccgo-4.9/+bug/1304754/+attachment/4160288/+files/dmesg_1.18_output.txt
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1304754
>
> Title:
>   gccgo has issues when page size is not 4kB
>
> Status in The GNU Compiler Collection:
>   Fix Released
> Status in “gcc-4.9” package in Ubuntu:
>   Fix Released
> Status in “gccgo-4.9” package in Ubuntu:
>   Invalid
> Status in “gcc-4.9” source package in Trusty:
>   Invalid
> Status in “gccgo-4.9” source package in Trusty:
>   In Progress
> Status in “gcc-4.9” source package in Utopic:
>   Fix Released
> Status in “gccgo-4.9” source package in Utopic:
>   Invalid
>
> Bug description:
>   On kernels 3.13-18 and 3.13-23 (there may be others) the kernel is
>   killing gccgo compiled binaries
>
>   [18519.444748] jujud[19277]: bad frame in setup_rt_frame:
>   0000000000000000 nip 0000000000000000 lr 0000000000000000
>   [18519.673632] init: juju-agent-ubuntu-local main process (19220)
>   killed by SEGV signal
>   [18519.673651] init: juju-agent-ubuntu-local main process ended, respawning
>
>   In powerpc/kernel/signal_64.c:
>
>   sys_rt_sigreturn is jumping to the badframe: label and executing an
>   unconditional force_sigsegv which is delivered to the userland
>   process. Like C++, gccgo tries to decode SIGSEGV as a nil pointer
>   access and blame some random function that happened to be the top
>   stack frame.
>
>   Reverting to the 3.13-08 kernel appears to resolve the issue which
>   (weakly) points the finger at the recent switch to 64k pages.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gcc/+bug/1304754/+subscriptions

Revision history for this message

Antonio Rosales (arosales) wrote on 2014-07-23:

#48

For reference the new bug Matt opened is:
https://bugs.launchpad.net/ubuntu/+source/juju-core/+bug/1347322

-thanks,
Antonio

Revision history for this message

Matthias Klose (doko) wrote on 2014-07-24:

#49

I copied the gccgo-4.9 package from the ubuntu-toolchain-r/ppa package to trusty-proposed (now waiting for approval). The following testing was done:

- libgcc1: the packages contains the shared library, exports the same symbols,
no code changes were done for libgcc1 itself.

- the testsuite doesn't show regressions on any architectures (although we should
only be interested in regressions in gccgo and libgo; the package in trusty
didn't ship a cc1, so we don't have any regressions for the C compiler).

- The only package build-depending on gccgo in trusty (juju-core) was successfully rebuilt.

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2014-07-25:

#50

I believe the gccgo 4.9.1 compiler fix is a good thing. After rebuilding Juju with gccgo-4.9_4.9.1-1ubuntu3_ppc64el.deb I was unable to get the "juju ssh" problem I had seen previously. I also did some load testing on juju with multiple bootstrap, deploy, and destroy cycles without any juju related problems.

See https://bugs.launchpad.net/ubuntu/+source/juju-core/+bug/1347322 Comment #7 for the details of how we built with the gccgo 4.9.1 compiler.

Thanks.

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-07-25:

#51

On Fri, Jul 25, 2014 at 10:35:41PM -0000, Matt Bruzek wrote:
> I believe the gccgo 4.9.1 compiler fix is a good thing. After
> rebuilding Juju with gccgo-4.9_4.9.1-1ubuntu3_ppc64el.deb I was unable
> to get the "juju ssh" problem I had seen previously.

For purposes of this SRU bug, please verify the juju-core
1.18.4+dfsg-0ubuntu0.14.04.1 package in trusty-proposed, not a locally-built
juju-core package built using the compiler from utopic.

Revision history for this message

Matthias Klose (doko) wrote on 2014-09-09:

#52

new version 4.9.1-13ubuntu1 copied to trusty-proposed, requested by mwhudson. waiting for review.

Revision history for this message

Tim Penhey (thumper) wrote on 2014-09-09:

#53

I can verify that with the above package from trusty-proposed, I was able to bootstrap a local juju environment on rockne-02.

tags:

added: verification-done

Revision history for this message

Tim Penhey (thumper) wrote on 2014-09-09:

#54

However rockne-02 has a 4k page :-(

tags:

removed: verification-done

Revision history for this message

Tim Penhey (thumper) wrote on 2014-09-09:

#55

Gah, must have looked at the page size of my laptop instead:

ubuntu@rockne-02:~$ juju status
environment: local
machines:
  "0":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.18.4.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
services: {}
ubuntu@rockne-02:~$ uname -a
Linux rockne-02 3.13.0-18-generic #38-Ubuntu SMP Mon Mar 17 21:41:16 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@rockne-02:~$ getconf PAGESIZE
65536

tags:

added: verification-done

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-09-12:

#56

This bug was fixed in the package gccgo-4.9 - 4.9.1-0ubuntu1

---------------
gccgo-4.9 (4.9.1-0ubuntu1) trusty-proposed; urgency=medium

  * Upload the final GCC 4.9.1 release.
  * Merge changes from gcc-4.9 4.9.0-2ubuntu1, including:
    - Fix PR go/60931, garbage collector issue with non 4kB system page size.
      LP: #1304754.
    - Fix wrong-code issue in the little endian vector API (ppc64el).
      LP: #1311128.
    - Fix ABI incompatibility between POWER and Z HTM builtins and intrinsics.
      LP: #1320292.
    - Fix an ICE with invalid code. PR c++/61046. LP: #1313102.
    - gccgo: Don't overwrite memory if an archive has a bad file name.
  * Include the cc1 binary into the gccgo-4.9 package.
  * Do not build-depend on sdt-systemtap for the trusty upload.
  * Warn about ppc ELFv2 ABI issues, which will change in GCC 4.10.
-- Matthias Klose <email address hidden> Thu, 17 Jul 2014 15:51:15 +0200

Changed in gccgo-4.9 (Ubuntu Trusty):
status:	In Progress → Fix Released

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-09-12: Update Released

#57

The verification of the Stable Release Update for gccgo-4.9 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-11-24: Please test proposed package

#58

Hello Dave, or anyone else affected,

Accepted gccgo-4.9 into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/gccgo-4.9/4.9.3-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags:	removed: verification-done
tags:	added: verification-needed

Revision history for this message

Steve Langasek (vorlon) wrote on 2016-01-06:

#59

This change was already included in the previous sru of gccgo-4.9, 4.9.1-0ubuntu1; no re-verification is required.

tags:

added: verification-done
removed: verification-needed

Ubuntu
gccgo-4.9 package

gccgo has issues when page size is not 4kB

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
gcc	Fix Released	Medium	gcc-bugzilla #60931
gcc-4.9 (Ubuntu)	Fix Released	Medium	Unassigned
Trusty	Invalid	Undecided	Unassigned
Utopic	Fix Released	Medium	Unassigned
gccgo-4.9 (Ubuntu)	Invalid	Medium	Unassigned
Trusty	Fix Released	High	Matthias Klose	Ubuntu trusty-updates
Utopic	Invalid	Medium	Unassigned

Ubuntugccgo-4.9 package

gccgo has issues when page size is not 4kB

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
gccgo-4.9 package