NS_InvokeByIndex in xptcinvoke_arm.cpp is not Thumb-2 safe for Lucid

Bug #488354 reported by Dave Martin on 2009-11-25
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mozilla Firefox
Fix Released
Medium
xulrunner-1.9.1 (Ubuntu)
High
Alexander Sack
Lucid
High
Alexander Sack

Bug Description

Binary package hint: xulrunner-1.9.1

Two issues:

1) Either this file should be built individually as ARM code using -marm, or the code needs to be reviewed and modified to be Thumb-2 safe.

Specifically, calling a function using:
  mov lr, pc
  mov pc, ip

...will not interwork properly between ARM and Thumb state, depending on the function called.

Instead, something like the following is needed:
#ifdef __thumb__
  blx ip
0:
#else
  mov lr, pc
  mov pc, ip
#endif

I'll try and propose a suitable patch.

I'm currently scanning to see whether there is other problematic code in this package.

Patch posted below.

2) The kernel user helpers are not currently used for implementing atomic operations. (This is the same backend used by the GCC __sync_*() atomic intrinsics.)

There is a configure option --with-arm-kuser to enable this, but it's not currently bassed by the build rules. It's been supported since ff 3.5 / linux 2.6.17 (if I remember correctly)

summary: - NS_InvokeByIndex in xptcinvoke_arm.cpp is not Thumb-2 safe
+ NS_InvokeByIndex in xptcinvoke_arm.cpp is not Thumb-2 safe for Lucid
Dave Martin (dave-martin-arm) wrote :

Looks like a patch was already done for the ARM Linux Internet Platform (see http://linux.onarm.com/bugzilla/show_bug?id=36)

patch: http://linux.onarm.com/bugzilla/attachment.cgi?id=5

I'm currently trying to do a build with this patch.
Generally, I'm also assuming that https://bugs.launchpad.net/ubuntu/+bug/488302 will get fixed.

Dave Martin (dave-martin-arm) wrote :

Note: the above patch is not upstream yet, but should get upstreamed in the meantime.

It would be good to pull this into Ubuntu now though, if possible.

description: updated
Dave Martin (dave-martin-arm) wrote :

Got a build and tested it against lucid firefox-3.5.

I didn't appear to get any problems (I'm reporting this bug using the Thumb-2 xulrunner library).
Attached is a gdb session transcript, where I set a breakpoint on the NS_InvokeByIndex implementation: the code is definitely Thumb-2, as can be observed by the instruction offsets not divisible by 4, and the mixture of instruction sizes (among other things).

I step some calls and returns through NS_InvokeByIndex; it seems to be doing the right thing.

Changed in xulrunner-1.9.1 (Ubuntu):
assignee: nobody → Alexander Sack (asac)

Created an attachment (id=415476)
patch - v1

Upstreaming a patch from Kevin Welton <email address hidden> to fix NS_InvokeByIndex() to use BLX for method invocations when compiling for Thumb. This comes from http://linux.onarm.com/bugzilla/show_bug.cgi?id=36.

looks like that's patching a patch that was already backed out. http://hg.mozilla.org/mozilla-central/rev/daa5346f19d6

We can use the #ifdef _thumb_ for future work though.

(In reply to comment #1)
> looks like that's patching a patch that was already backed out.
> http://hg.mozilla.org/mozilla-central/rev/daa5346f19d6

That's the backout commit...

This patch applies fine to trunk with a few minor offsets.

There's no reason to not use BLX all the time, no? I thought that's what the updated xptcinvoke patch did.

description: updated

(From update of attachment 415476)
for lazy people, that's bug 530087

>+ "blx ip \n\t" /* call it... */
> "mov lr, pc \n\t" /* call it... */

comments should match the conventions of our file, that means the exact same whitespace, no compression.

>+ "blx ip \n\t" /* copy args to the stack like the
>+ * compiler would */

i think:
        /* .... */
        /* .... */

>+#ifdef __thumb__
>+ "blx ip \n\t" /* call method */
>+ "mov lr, pc \n\t" /* call method */

Changed in firefox:
status: Unknown → Confirmed
Alexander Sack (asac) on 2009-12-04
Changed in xulrunner-1.9.1 (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in xulrunner-1.9.1 (Ubuntu Lucid):
milestone: none → lucid-alpha-2

You can (and should) use BLX all the time in this case, but it won't then work on ARMv4. The JIT doesn't work on ARMv4 anyway, and I don't think we care about such old machines.

Having said that, some distributions might care about ARMv4, and they might want support, even if it's slow. (Don't ask me why!) A solution here would be to use the old-style "MOV pc, ip" on ARMv4 (using a compile-time switch of some kind) and BLX everywhere else. We would also have to disable the JIT for ARMv4, but ARMv4 distributions must do that anyway because the JIT hasn't had ARMv4 support for a long time.

NOTE: Using "MOV pc, ip" will break the return stack prediction on newer cores, so even if you don't want to interwork, use BLX <reg>. The performance difference should be significant.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package xulrunner-1.9.1 - 1.9.1.7+nobinonly-0ubuntu1

---------------
xulrunner-1.9.1 (1.9.1.7+nobinonly-0ubuntu1) lucid; urgency=low

  * New upstream release v1.9.1.7 (FIREFOX_3_5_7_RELEASE)
    - see USN-878-1

  [ Micah Gersten <email address hidden> ]
  * Drop patch after upstream landing of (bmo: 521780) aka
    extension upgrade with a moved location breaks extension manager
    - drop debian/patches/lp441552_bz521780_att407108.patch
    - update debian/patches/series
  * Add patch to make Thumb-2 Safe for Lucid (LP: #488354)
    - add debian/patches/bz532198_lp488354_ns_invokebyindex_not_thumb2_safe.patch
    - update debian/patches/series

  [ Dmitrijs Ledkovs <email address hidden> ]
  * Added dh_xulrunner from Debian (LP: #498973)
    - add debian/dh/dh_xulrunner.in
    - add debian/dh/xulrunner.pm
    - add debian/xulrunner-1.9.1-dev.manpages
    - update debian/rules
    - update debian/xulrunner-1.9.1-dev.install
 -- Micah Gersten <email address hidden> Tue, 05 Jan 2010 17:50:47 -0600

Changed in xulrunner-1.9.1 (Ubuntu Lucid):
status: Triaged → Fix Released

Created an attachment (id=428899)
Use BLX instead of "MOV pc" function calls.

The new patch uses BLX all the time, and thus imposes an ARMv5T baseline requirement for XPCOM. (We previously had this requirement for the JIT anyway, but that could be trivially disabled for crazy people who wanted to compile for ARMv4.)

I've (barely) tested this with the ARM build, but couldn't get GCC 4.3.3 to produce a Thumb-2 build. (The error looked like a compiler error in an unrelated module.) I could massage it into building just the bits we care about with -mthumb, but it takes a long time to build anyway, let alone with me hacking the build system.

In any case, I've delayed enough for such a trivial patch, so here it is. Whilst it's trivial, it wants pushing to the try server or something like that before it's actually pushed.

(In reply to comment #6)
> Having said that, some distributions might care about ARMv4, and they might
> want support, even if it's slow. (Don't ask me why!)

Here is why: openmoko is armv4. Debian supports openmoko.

(In reply to comment #8)
> (In reply to comment #6)
> Here is why: openmoko is armv4. Debian supports openmoko.

Hmm. The Neo FreeRunner uses an ARM920T, which is an ARMv4T device. However, it sits at 400MHz and has just 128MB of RAM. I admit that this is the closest ARMv4(T) device I've seen to being a realistic platform, but is it really worth going out of our way to support? We would have to include #ifdef blocks in our code to support ARMv4T, and these blocks will need testing.

Yeah, it's highly unlikely that we'll support any ARMv4 platforms going forward. (Even ARMv5 is borderline, but there's little cost in supporting it if we want to support ARMv6 anyway.)

Created an attachment (id=433119)
(Almost) C-only version

How about "simply" replacing the assembly with (almost) C code only ? The compiler will take care of using the proper opcode for the target architecture. It will also create more efficient code and also inline the invoke_* functions.

I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t target)

Original assembly:
 direct took 3.78 seconds
 invoke took 50.84 seconds
 So, invoke overhead was ~ 47.06 seconds (~ 93%)

With C, without inlining:
 direct took 3.62 seconds
 invoke took 38.90 seconds
 So, invoke overhead was ~ 35.28 seconds (~ 91%)

With inlined invoke functions:
 direct took 3.62 seconds
 invoke took 37.53 seconds
 So, invoke overhead was ~ 33.91 seconds (~ 90%)

Note I'm not entirely sure the attached patch is totally safe for any compiler, but it works properly when built with gcc 4.4. If this is not the case, I'm pretty sure this can be made safer.

Also note invoke_count_words could also be removed, to improve speed even more, as per bug 530087.

(In reply to comment #11)
> I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T
> target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t
> target)

(Note this was compiled for ARMv4T, but run on an ARMv5T)

That's a neat solution. The only worry I have is the fiddling of the stack pointer within C code; that is likely to break even in the same compiler with unrelated code changes causing it to prod the stack immediately before calling "func".

Actually, I'd be interested in seeing the assembly for that, as I'd like to see what it's doing differently.

(In reply to comment #11)
> Note I'm not entirely sure the attached patch is totally safe for any compiler,
> but it works properly when built with gcc 4.4. If this is not the case, I'm
> pretty sure this can be made safer.

That doesn't matter; the original assembly was extremely compiler-dependent!

> Also note invoke_count_words could also be removed, to improve speed even more,
> as per bug 530087.

Not without sacrificing memory: invoke_copy_to_stack on WinCE fills the stack downwards and avoids calling invoke_count_words to work out how much space to reserve. On Linux, with EABI, 64-bit arguments must be 64-bit aligned, so it's impossible for invoke_copy_to_stack to know where to put each argument (with one pass) if it's filling them backwards; the alignment requirements on an argument early in the list can shift everything above it.

The best solution I have is to reserve arg_count*2 words on the stack and waste stack space if we have 32-bit arguments. That will allow us to stick to one pass, skipping invoke_copy_to_stack entirely and filling the arguments forwards with proper alignment, at the cost of some stack space.

(In reply to comment #12)
> (In reply to comment #11)
> > I get a 25% speed improvement on the TestXPTCInvoke speed test with an ARMv4T
> > target. (note it ends up using bl on armv4t, and blx when using e.g. an armv5t
> > target)
>
> (Note this was compiled for ARMv4T, but run on an ARMv5T)

"bl" and "blx" will exhibit equivalent performance; the only difference is when interworking is involved. Return stack prediction is used for both (but maybe not on an ARMv5T device).

There are two issues here. I think you know these anyway, but for the sake of documentation & clarity in the bug:
 * Correctness on a system with mixed ARM & Thumb code: "blx" _must_ be used here.
 * Performance:
   - ARMv4T interworking branches aren't return-stack-predicted on newer cores, and this costs a huge amount of performance. (I've seen figures ranging from 5% to 50%, but I don't know what would happen in this particular case.)
   - ARMv4 long-range branches involve doing "mov pc, <dst_reg>". That's what the old code used to do. ("bl" can't branch to an address in a register and is restricted to PC±32MB.)

----

I was planning to work on this (and 530087) further over the next few days because it's been in my "to do" list for ages, but you're welcome to take it if you want to.

(In reply to comment #13)
> That's a neat solution. The only worry I have is the fiddling of the stack
> pointer within C code; that is likely to break even in the same compiler with
> unrelated code changes causing it to prod the stack immediately before calling
> "func".

I think it should be safe, actually, because the compiler knows the stack pointer is being fiddled with by the assembly, and avoids using it. Without forcing the stack pointer with the assembly, the compiled code modifies sp to read the first 3 arguments. (iirc, once with add and once with ldmia)
While this would actually do the same thing, it wouldn't have the same guarantees.

> Actually, I'd be interested in seeing the assembly for that, as I'd like to see
> what it's doing differently.

Will attach.

> There are two issues here. I think you know these anyway, but for the sake of
> documentation & clarity in the bug:
> * Correctness on a system with mixed ARM & Thumb code: "blx" _must_ be used
> here.
> * Performance:
> - ARMv4T interworking branches aren't return-stack-predicted on newer cores,
> and this costs a huge amount of performance. (I've seen figures ranging from 5%
> to 50%, but I don't know what would happen in this particular case.)
> - ARMv4 long-range branches involve doing "mov pc, <dst_reg>". That's what
> the old code used to do. ("bl" can't branch to an address in a register and is
> restricted to PC±32MB.)

The undeniable advantage of the C code is that the compiler will just use the right construct depending on the target architecture. No need to #ifdef __thumb__ or whatever.

> I was planning to work on this (and 530087) further over the next few days
> because it's been in my "to do" list for ages, but you're welcome to take it if
> you want to.

Since I've gone this far already, I think I'll just take it ;)

(In reply to comment #14)
> The undeniable advantage of the C code is that the compiler will just use the
> right construct depending on the target architecture. No need to #ifdef
> __thumb__ or whatever.

Oh, certainly. C code is always preferable to assembly for that very reason, unless you need to do something that C code can't express efficiently (though that isn't the case here).

I'm still not entirely convinced about the SP-fiddling though. It seems unsafe to me. For instance, how do you know that the stack_space array is at the bottom of your frame? C doesn't guarantee anything there (even though you declare it last). Yes, it probably works, but it feels unsafe. That's why the original was in assembly, so it was clear what was happening.

In the case of writing to SP in the asm block: Yes, the compiler knows you've clobbered it, but C mandates that you have a stack and I'd be surprised if the compiler copes well with you doing that. If it isn't using a frame pointer, for example, it won't know how to wind back the stack pointer. Perhaps writing to SP in this way forces it to use a frame pointer, but none of that behaviour is documented (as far as I know) so it's likely to break between even minor compiler revisions.

I like the C implementation, but it does make me feel nervous :-)

Download full text (3.7 KiB)

(In reply to comment #15)
> (In reply to comment #14)
> > The undeniable advantage of the C code is that the compiler will just use the
> > right construct depending on the target architecture. No need to #ifdef
> > __thumb__ or whatever.
>
> Oh, certainly. C code is always preferable to assembly for that very reason,
> unless you need to do something that C code can't express efficiently (though
> that isn't the case here).
>
> I'm still not entirely convinced about the SP-fiddling though. It seems unsafe
> to me. For instance, how do you know that the stack_space array is at the
> bottom of your frame?

Because it is the only thing that is allocated on stack in the function. But we can probably use __builtin_alloca instead if you're uncomfortable with that, it generates the same code.

> C doesn't guarantee anything there (even though you
> declare it last). Yes, it probably works, but it feels unsafe. That's why the
> original was in assembly, so it was clear what was happening.
>
> In the case of writing to SP in the asm block: Yes, the compiler knows you've
> clobbered it, but C mandates that you have a stack and I'd be surprised if the
> compiler copes well with you doing that. If it isn't using a frame pointer, for
> example, it won't know how to wind back the stack pointer. Perhaps writing to
> SP in this way forces it to use a frame pointer, but none of that behaviour is
> documented (as far as I know) so it's likely to break between even minor
> compiler revisions.

I don't know if -fomit-frame-pointer is supposed to work on gcc arm, but i can't get gcc to stop using the frame pointer. The function is always marked with "frame_needed = 1", while other functions from the same source are marked as "frame_needed = 0", and as such don't use a frame pointer (even without -fomit-frame-pointer ; I guess it is included in -O2).

Anyways, I can think of several other approaches using a bit more assembly, but still leaving most of the work to the C code, I'll check what kind of code get generated with that.

Here is the assembly i get with the already attached C code (forcing -fno-inline), with an armv4t target:
        stmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
        mov r5, r0
        add fp, sp, #28
        mov r6, r1
        mov r0, r2
        mov r1, r3
        mov r7, r2
        mov r8, r3
        bl _ZL18invoke_count_wordsjP13nsXPTCVariant(PLT)
        mov r3, r0, asl #2
        add r3, r3, #18
        bic r3, r3, #7
        sub sp, sp, r3
        mov r1, r7
        mov r2, r8
        add r0, sp, #4
        bl _ZL20invoke_copy_to_stackPjjP13nsXPTCVariant(PLT)
        mov r4, sp
        add r3, sp, #16
#APP
@ 200 "xptcinvoke_arm.cpp" 1
        mov sp, r3
@ 0 "" 2
        mov r0, r5
        ldr r3, [r4, #12]
        ldr ip, [r5, #0]
        ldmib r4, {r1, r2} @ phole ldm
        ldr ip, [ip, r6, asl #2]
        mov lr, pc
        bx ip
        sub sp, fp, #28
        ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, lr}
        bx lr

Here it uses bl for the function calls and bx for the method invoke.

With a armv5t ta...

Read more...

> Because it is the only thing that is allocated on stack in the function. But we
> can probably use __builtin_alloca instead if you're uncomfortable with that, it
> generates the same code.

Not necessarily: For one thing, the "register" qualfier is just a hint, but
there may also be compiler temporaries that go on the stack.

Admittedly, for a function of this size and complexity, it's a fairly safe
assumption, but I think perhaps it needs a comment warning others from adding
code without considering the consequences, as it would be dreadfully easy to
break.

Is there any way we can ASSERT this? I don't think there's an easy way, but I
suppose in the assembly snippet we could assert that sp == stack_space before
assigning it:

#ifdef DEBUG
__asm__ __volatile__ (
    "cmp sp, %1"
    "moveq %0, #1"
    "movne %0, #0"
    "add sp, #(4*4)"
    : "=r" (stack_ok)
    : "r" (stack_space)
    : "sp"
);
#else
__asm__ __volatile__ (
    "add sp, #(4*4)" /* sp is at stack_space[0], so advance it to [4].
                             * (This saves &stack_space[4] from being allocated
                             * a register. */
    :
    :
    : "sp"
);
#endif

> I don't know if -fomit-frame-pointer is supposed to work on gcc arm, [...]

It sounds like touching the SP does force frame_needed to 1, so it's probably
safe.

> With a armv5t target, the end changes to the following:
> ldmib r4, {r1, r2} @ phole ldm
> mov lr, pc
> ldr pc, [ip, r6, asl #2]
> sub sp, fp, #28
> ldmfd sp!, {r4, r5, r6, r7, r8, r9, fp, pc}

Ouch. I don't know why the compiler is emiting "ldr pc, [...]". It'd be
preferable to load into a temporary (such as "ip") and then blx to that. Hmm.
What do you get if you specify "-mcpu=cortex-a8"? It's possible that the
compiler is assuming that it doesn't matter for ARMv5T devices (and it's
probably right).

(In any case, this is a compiler-related issue that will occur in all the
compiled code.)

> Without the assembly, this is how the first three 32-bits words are read from
> the stack:
> ldr r3, [sp, #12]
> ldr ip, [r5, #0]
> ldmib sp, {r1, r2} @ phole ldm
>
> In the end, sp would point on the third 32-bits word, instead of the fourth.

Nope, sp isn't updated there. It's a strange way to load the values as a single
ldm would suffice, but I suppose it would work and wouldn't affect performance
much.

----

I just have one final concern, and then I'll leave you alone :-) We need to
guarantee that the stack space is 64-bit aligned (for EABI compliance), but I'm
not sure how you do that. Indeed, you removed some alignment code from
invoke_count_words. Have you confirmed that this is actually correct? Any issues
would only show up with 64-bit arguments (and only then if they don't fall
across an alignment boundary).

I missed the fairly obvious call to "ASSERT(stack_ok)" in my DEBUG example :-)

(In reply to comment #17)
> I just have one final concern, and then I'll leave you alone :-) We need to
> guarantee that the stack space is 64-bit aligned (for EABI compliance), but I'm
> not sure how you do that. Indeed, you removed some alignment code from
> invoke_count_words. Have you confirmed that this is actually correct?

Stack allocation already guarantees the stack pointer is going to be aligned. That's why the allocation on stack is space + 1, and the pointer passed to invoke_copy_to_stack is &space_stack[1], so that we start filling the stack on an unaligned word, and that the fourth argument is correctly aligned.

Anyways, I am currently testing a rather different approach (combined with the removal of invoke_count_words, because it makes things easier) that doesn't do too much magic with the stack.

I should be done (including full xpcshell-tests run) by tonight or tomorrow.

Right-o. This will also resolve bug 530087 :-)

I have a incomplete patch for the moment but the results are already good:
(somehow the machine I'm using got faster since yesterday, so I reiterated all the benchmarks for comparison)

original code:
 direct took 2.20 seconds
 invoke took 29.81 seconds
 So, invoke overhead was ~ 27.61 seconds (~ 93%)

with attachment 433119:
 direct took 2.20 seconds
 invoke took 22.66 seconds
 So, invoke overhead was ~ 20.45 seconds (~ 90%)

with the WIP patch:
 direct took 2.22 seconds
 invoke took 16.94 seconds
 So, invoke overhead was ~ 14.72 seconds (~ 87%)

We're looking at almost 2x improvement, though the speed test is actually biased, as it only tries one type of calls, that doesn't even need to use the stack.

The current status of the WIP is that it is clean of any assembly and works properly, except that it lacks proper comments, and more importantly, breaks the non EABI case (which I unfortunately can't test, but I can at least try to make it theoretically work).

Some values for a modified speed test, using 10000000 iterations (10 times less) over AddMixedFloats instead of AddTwoInts:

original code:
 direct took 10.72 seconds
 invoke took 17.41 seconds
 So, invoke overhead was ~ 6.69 seconds (~ 38%)

with attachment 433119:
 direct took 10.79 seconds
 invoke took 16.69 seconds
 So, invoke overhead was ~ 5.90 seconds (~ 35%)

with the WIP patch:
 direct took 10.78 seconds
 invoke took 14.58 seconds
 So, invoke overhead was ~ 3.80 seconds (~ 26%)

I wouldn't worry about supporting non-EABI in Linux. It might be nice to have a conditional #error in case someone tries it. GCC still supports non-EABI for now, at least, though I don't know of anyone using it.

A 2* improvement is fantastic! I'm guessing that's because you only make one pass of the argument list; I saw similar effects on the WinCE version with that modification (which was really simple for WinCE's calling convention).

Created an attachment (id=433581)
C-only version

The patch is unfortunately not easily readable. It's probably better to apply it to see the resulting file. Please tell me if the comments are not clear enough.

It is now free of assembly, and I think it is now perfectly safe.

FYI, the main reason I went with the copy_double_word function is that somehow, gcc would generate 2 almost identical assembly snippets for the double word copies when I just inlined the code like it is in the original code: 1 for nsXPTType::T_I64 and nsXPTType::T_U64 and another one for nsXPTType::T_DOUBLE.
And when I say almost identical, it's almost pathetic that it can't factor:
        add r0, r0, #7 add r0, r0, #7
        bic r7, r0, #7 bic r7, r0, #7
        cmp r7, r5 cmp r5, r7
        moveq r7, lr moveq r7, lr
        sub r9, r3, #16 sub r9, r3, #16
        ldmia r9, {r8-r9} ldmia r9, {r8-r9}
        add r0, r7, #4 add r0, r7, #4
        stmia r7, {r8-r9} stmia r7, {r8-r9}
        b .L5 b .L5

As a nice side effect, it made it cleaner to keep non-eabi support, so I just implemented it.

During the development of this patch, I hit some other pathetic cases of stupid gcc optimizations that I'll blog about.

Mike - I will apply the patch and test on an N900 and N810. Are there tests apps you have been using to get your measurements?

I was going to test using some talos suites too.

When building for N900:

c++ -o xptcinvoke_arm.o -c -fvisibility=hidden -DMOZILLA_INTERNAL_API -DOSTYPE=\"Linux2.6\" -DOSARCH=Linux -DEXPORT_XPTC_API -D_IMPL_NS_COM -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/../.. -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/../../../../xptinfo/src -I/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix -I. -I../../../../../../dist/include -I../../../../../../dist/include/nsprpub -I/home/mfinkle/mozilla-192/mobile/xulrunner/dist/include/nspr -I/home/mfinkle/mozilla-192/mobile/xulrunner/dist/include/nss -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-long-long -pedantic -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -fno-inline -Os -freorder-blocks -fno-reorder-functions -finline-limit=50 -O2 -DMOZILLA_CLIENT -include ../../../../../../mozilla-config.h -Wp,-MD,.deps/xptcinvoke_arm.pp /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: In function 'PRUint32* copy_double_word(PRUint32*, PRUint32*, PRUint32*, PRUint64*)':
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:75: warning: cast from 'PRUint32*' to 'PRUint64*' increases required alignment of target type
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: In function 'nsresult NS_InvokeByIndex_P(nsISupports*, PRUint32, PRUint32, nsXPTCVariant*)':
/home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:158: error: ISO C++ forbids variable-size array 'stack_space'
make[8]: *** [xptcinvoke_arm.o] Error 1

Created an attachment (id=433641)
Additional patch

Mark, can you try squashing this patch ?

(In reply to comment #25)
> Mike - I will apply the patch and test on an N900 and N810. Are there tests
> apps you have been using to get your measurements?

The measurements in e.g. comment 11 come from TestXPTCInvoke in xpcom/reflect/xptcinvoke/tests (you need to uncomment the DoSpeedTest call)

(In reply to comment #27)
> Created an attachment (id=433641) [details]
> Additional patch
>
> Mark, can you try squashing this patch ?

With this patch applied on the first patch, I get a successful build.

However, when attempting to launch Firefox Mobile on my N900, I get a Segmentation fault.

(In reply to comment #29)
> However, when attempting to launch Firefox Mobile on my N900, I get a
> Segmentation fault.

Can you try building the TestXPTCInvoke program in xpcom/reflect/xptcinvoke/tests (just make -C xpcom/reflect/xptcinvoke/tests) and then run it with:
dist/bin/run-mozilla.sh dist/bin/TestXPTCInvoke ? How far does it go ?

Also, is the N900 toolchain building EABI binaries ?

(In reply to comment #30)

> Can you try building the TestXPTCInvoke program in
> xpcom/reflect/xptcinvoke/tests (just make -C xpcom/reflect/xptcinvoke/tests)
> and then run it with:
> dist/bin/run-mozilla.sh dist/bin/TestXPTCInvoke ? How far does it go ?

Completes the whole thing. Even with DoSpeedTests uncommented.

> Also, is the N900 toolchain building EABI binaries ?

I think so. The arch in the packaged filenames says "guneabi"

I have not been able to install gdb on my N900, but I have seen that the segfault is happening in invoke_copy_to_stack. I'll try to narrow it down.

(From update of attachment 433581)
>+ current = (PRUint32 *)(((PRUint32)current + 7) & 0xfffffff8);

'~7' is clearer to me than '0xfffffff8'. I suppose that's subjective,
though.

>+ *((PRUint64*) current) = *dw;

I don't think that's valid in the non-EABI case where 'current' isn't
8-byte aligned. It will probably work anyway, but I think C assumes that
64-bit values have 64-bit alignment; LDRD and STRD had this restriction
on older architectures. I think the non-EABI and EABI cases should be
entirely contained in the #ifdef to avoid these issues.

>+ /* Wrap when reaching the end of the stack buffer */
>+ if (d == end) d = stk;

It might be wise to add a debug assertion here to check that 'd' is
within 'stk' and '&stk[end]'. That will catch some logic errors that can
be hard to spot, and easy to introduce in future patches.

> * The routine will issue calls to count the number of words
> * required for argument passing and to copy the arguments to
> * the stack.

We don't do this any more. Well, not explicitly anyway. Can you clobber
this bit of the comment please?

----

I like the look of this patch (with the 'Additional patch' on top), and it appears correct so I've marked it 'r+'. However, the N900 crash obviously needs to diagnosed & fixed before this is pushed.

Thanks,
Jacob

(In reply to comment #34)
> '~7' is clearer to me than '0xfffffff8'. I suppose that's subjective,
> though.

I prefer, too, but I just left what already was there.

> >+ *((PRUint64*) current) = *dw;
>
> I don't think that's valid in the non-EABI case where 'current' isn't
> 8-byte aligned. It will probably work anyway, but I think C assumes that
> 64-bit values have 64-bit alignment; LDRD and STRD had this restriction
> on older architectures. I think the non-EABI and EABI cases should be
> entirely contained in the #ifdef to avoid these issues.

I don't think there is a 64-bit alignment requirement on architectures that matter. FWIW, in Debian we target armv4t, and upstream Mozilla codebase already targets at least armv5.

> >+ /* Wrap when reaching the end of the stack buffer */
> >+ if (d == end) d = stk;
>
> It might be wise to add a debug assertion here to check that 'd' is
> within 'stk' and '&stk[end]'. That will catch some logic errors that can
> be hard to spot, and easy to introduce in future patches.

Fair enough.

> I like the look of this patch (with the 'Additional patch' on top), and it
> appears correct so I've marked it 'r+'. However, the N900 crash obviously needs
> to diagnosed & fixed before this is pushed.

Totally agreed. I'd like some more feedback from Mark if possible.

Any update, Mark ?

We are looking into this. Brian Crowder was able to get a stack from a build:

#0 0xbe9ff4a8 in ?? ()
#1 0x43162a2c in invoke_copy_to_stack (stk=0x12, paramCount=3198154224, s=0x8) at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:92
#2 0x42693488 in XPCWrappedNative::CallMethod (ccx=..., mode=<value optimized out>)
    at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2722
#3 0x4269c2c0 in XPCWrappedNative::GetAttribute (ccx=...) at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcprivate.h:2524
#4 0x4269993c in XPC_WN_GetterSetter (cx=0x41c05800, obj=<value optimized out>, argc=0, argv=0x40455608, vp=0xbe9ff708)
    at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1784
#5 0x406d7c0c in js_Invoke (cx=0x41c05800, argc=0, vp=0x40455600, flags=<value optimized out>)

I have been testing on mozilla-192. I will start testing on mozilla-central as well and report.

The xptcinvoke code is the same on 1.9.2 and trunk. It would be interesting to have the values for d, stk and paramCount on your stacktrace, but that would either need to break on the invoke_copy_to_stack function entry, or run an unoptimized build.

I made a non-optimized build from mozilla-central. The build works fine on my N900.

I am now going to make a non-optimized build from mozilla-192 and try it.

Non optimized build from mozilla-192 segfaults on startup. Here is the stack:

Starting program: /home/opt/fennec/fennec

Program received signal SIGSEGV, Segmentation fault.
0xbe9efb70 in ?? ()
0xbe9efb70: andeq r0, r0, r0
(gdb) bt
#0 0xbe9efb70 in ?? ()
#1 0x43a98b30 in invoke_copy_to_stack (stk=0x421d2200, paramCount=1109287616, s=0x8)
    at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:92
#2 0x426db3f8 in XPCWrappedNative::CallMethod (ccx=..., mode=XPCWrappedNative::CALL_GETTER)
    at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednative.cpp:2722
#3 0x426ebdf4 in XPCWrappedNative::GetAttribute (ccx=...)
    at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcprivate.h:2524
#4 0x426e5fa0 in XPC_WN_GetterSetter (cx=0x40446200, obj=0x421d2200, argc=0, argv=0x404b05a0, vp=0xbe9eff3c)
    at /home/mfinkle/mozilla-192/mozilla/js/src/xpconnect/src/xpcwrappednativejsops.cpp:1784
#5 0x4076948c in js_Invoke (cx=0x40446200, argc=0, vp=0x404b0598, flags=2)
    at /home/mfinkle/mozilla-192/mozilla/js/src/jsinterp.cpp:1360

I tried stepping through invoke_copy_to_stack:

(gdb) step 1
80 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
88 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
86 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
87 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
nsXPTCVariant::IsPtrData (this=<value optimized out>) at ../../../../../../dist/include/xptcall.h:110
110 ../../../../../../dist/include/xptcall.h: No such file or directory.
 in ../../../../../../dist/include/xptcall.h
(gdb) step 1
invoke_copy_to_stack (stk=0xbe809940, paramCount=1, s=0xbe809b70)
    at /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp:91
91 /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp: No such file or directory.
 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
94 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
92 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1
88 in /home/mfinkle/mozilla-192/mozilla/xpcom/reflect/xptcall/src/md/unix/xptcinvoke_arm.cpp
(gdb) step 1

Program received signal SIGSEGV, Segmentation fault.
0xbe809b70 in ?? ()
0xbe809b70: andeq r0, r0, r0

Looks like the if(s->IsPtrData()) test is true and we attempt to continue. I should have been checking locals after each step. More coming.

Mike - One thing I overlooked: This appears to be broken only on mozilla-192, but not on mozilla-central. So it should be OK to checkin the patches you have and we can continue to work on getting this to mozilla-192.

Jacob & Brian - Do you guys agree?

I would personally prefer to understand the 192 broken-ness before landing it anywhere.

(In reply to comment #44)
> I would personally prefer to understand the 192 broken-ness before landing it
> anywhere.

Especially considering that really, there should be no difference between both.

(In reply to comment #44)
> I would personally prefer to understand the 192 broken-ness before landing it
> anywhere.

I certainly agree! The brokenness is probably still there in mozilla-central, but just not exercised. It might be fine, but until we understand what's happening, I don't trust it.

mfinkle, is this crashing on any/the first xptcall invocation? Or just a specific one? Stepping through it an instruction at a time and printing registers would be helpful; a gdb command that does 'si ; x/i $pc ; info reg' would probably be helpful, instead of 'step 1' above.

(In reply to comment #47)
> mfinkle, is this crashing on any/the first xptcall invocation? Or just a
> specific one?

Happens on the first invocation

> Stepping through it an instruction at a time and printing
> registers would be helpful; a gdb command that does 'si ; x/i $pc ; info reg'
> would probably be helpful, instead of 'step 1' above.

I am collecting the data now. Will attach a dump of the session.

Created an attachment (id=438574)
gdb session from N900

Here is the gdb session leading up to the crash - using the "si ; x/i $pc ; info reg" commands

Created an attachment (id=438688)
Patch v3

D'oh, the problem was so obvious and so problematic that I wonder how it could work without crashes on my builds, or with Mark's m-c builds O_o
The problem was that the stack buffer was too small in the paramCount == 1 case, and invoke_copy_to_stack would copy outside the buffer boundaries (it would write *before* the buffer, where the saved pc is for function return, in non optimized builds at least). The added assertion would catch this if this ever happens again.
Mark, can you check this patch ? I'll also try it on my end.

Created an attachment (id=438697)
Patch v3.1

This version actually builds. I'm still waiting for the build to fully finish to actually test.

Created an attachment (id=438700)
Patch v3.2

With a brown paper bag fix, this one seems to work properly. More testing undergoing.

(From update of attachment 438700)
I verified this one works for me. I think it will work for Mark, too.

(From update of attachment 438700)
This patch works for me! Nice job Mike.

Running Fennec on my N900. It launches fine and I am browsing pages and running sunspider.

I tested Ts on my own Fennec build with and without the patch:

(averaged over 10 runs)

Ts without: 5912ms
Ts with: 5787ms

Nice 14% improvement

Problem in Google Spreadsheet formula. Correct percentage: 2%

Not as good, but still positive!

(From update of attachment 438700)
Looks good to me. Nice job!

(From update of attachment 438700)
Looks good to me! Note that I'm not an xpcom peer -- bsmedberg is though, so asking him just to confirm this patch. Code looks fine to me, and is much cleaner. Thanks!

(From update of attachment 438700)
I'll delegate to people who know!

*** Bug 530087 has been marked as a duplicate of this bug. ***

Changed in firefox:
status: Confirmed → Fix Released
Micah Gersten (micahg) on 2010-04-21
Changed in firefox:
milestone: none → 3.7
Dave Martin (dave-martin-arm) wrote :

Note that the upstream fix is a cleanup which replaces the assembler code with C(++) veneers. This is nice for the future, but the alternate fix already present in firefox 3.6 for lucid should be adequate for now.

On Wed, Apr 21, 2010 at 09:13:09AM -0000, Dave Martin wrote:
> Note that the upstream fix is a cleanup which replaces the assembler
> code with C(++) veneers. This is nice for the future, but the alternate
> fix already present in firefox 3.6 for lucid should be adequate for now.

Thanks for your evaluation. I agree that we shouldnt re-backport this.

 - Alexander

(From update of attachment 438700)
Code is baking on trunk and would be helpful on 192 for Fennec 1.1

(From update of attachment 438700)
dropping for fennec 1.1

Changed in firefox:
importance: Unknown → Medium

*** Bug 532194 has been marked as a duplicate of this bug. ***

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.