ARM/Thumb interworking support missing from nanojit

Bug #490792 reported by Dave Martin on 2009-12-01
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
XULRunner
Fix Released
Medium
xulrunner-1.9.1 (Ubuntu)
Medium
Unassigned
Lucid
Medium
Alexander Sack

Bug Description

Binary package hint: xulrunner-1.9.1

The problem here is that the nanojit output tries to call functions in xulrunner using plain BL instructions, causing the CPU to interpret the called (Thumb-2) code as ARM... which doesn't work.

The symptoms are random SIGILLs and segfaults in firefox-3.5.

Apparently, the required support has been implemented in mozilla-central for the last few months, but unfortunately there is not a straightforward patch against the current Ubuntu source to fix this :(

Possible workarounds:
 1) Update to the firefox 3.6 / mozilla 1.9.2 branch on mozilla-central (http://hg.mozilla.org/releases/mozilla-1.9.2/ ... it's still in beta though, and there are unresolved issues about openjdk java plugin support --- see https://blueprints.launchpad.net/ubuntu/+spec/desktop-lucid-new-firefox-support-model)
 2) Build all of xulrunner with -marm (firefox may also need -marm --- it depends on whether the nanojit output might call absolutely anything or just support functions in xulrunner)

The most recent relevant commit on mozilla-central is http://hg.mozilla.org/mozilla-central/rev/f8ce8b4d6ce1
It looks like the relevant changes weren't on the mozilla 1.9.1 / firefox 3.5 branch.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu lucid (development branch)
Release: 10.04
Codename: lucid

$ apt-cache policy xulrunner-1.9.1 firefox-3.5
xulrunner-1.9.1:
  Installed: 1.9.1.5+nobinonly-0ubuntu2
  Candidate: 1.9.1.5+nobinonly-0ubuntu2
  Version table:
 *** 1.9.1.5+nobinonly-0ubuntu2 0
        500 http://ports.ubuntu.com/main Packages
        100 /var/lib/dpkg/status
firefox-3.5:
  Installed: 3.5.5+nobinonly-0ubuntu2
  Candidate: 3.5.5+nobinonly-0ubuntu2
  Version table:
 *** 3.5.5+nobinonly-0ubuntu2 0
        500 http://ports.ubuntu.com lucid/main Packages
        100 /var/lib/dpkg/status

In previous discussions with Adobe we decided that Thumb1 would not be a good target for a JIT. The code cache is supposed to contain code that is inherently hot, so trading off compactness for performance for hot code seems silly. Hence we removed Thumb1. We also discussed that Thumb2 most likely would actually be an improvement, so we want a Thumb2 target. I am not sure how to do the configuration. We seem to be drifting towards runtime configuration (vlad just added VFP detection). If the backend is small enough we might tolerate having a Thumb2 code generation mode in parallel to regular ARM. Any comments? vlad? ed?

I'm talking about the VM itself, not the JIT.

I think that a Thumb2 _JIT_ would certainly be beneficial, but I'm talking about compiling TM itself for Thumb2 (but generating ARM code, as it currently does).

the tamarin nanojit arm backend contains interworking support so the C code can be compiled for arm or thumb. but, NativeARM.cpp/h may have diverged too much in the two branches (tamarin / tracemonkey)

I agree, Thumb2 would be great. It should be a separate backend from NativeARM.cpp, however (clean slate).

Yep, I'm working from one of Andrew's patches for Tamarin. Whilst it won't apply cleanly to TM, I know how the changes work and it wouldn't take me long to implement equivalent changes; the build system modifications might take me longer. The question is whether or not it matters; if TM is only intended to be compiled as ARM (regardless of the JIT output), there is probably no need to generate interworking code.

---

A clean slate would be nice for a Thumb2 JIT, but I'd like to move that discussion out of this bug as it's confusing enough already :-) (I'm on #jsapi as "Jacob" if you want to discuss it.)

> * _If_ we ever expect to compile the VM itself for Thumb (or Thumb-2), we
> _need_ to address this issue so that branches from nanojit-generated ARM code
> can interwork properly.

I don't think we'll do this for spidermonkey.

> * If we don't ever expect to compile the VM for Thumb, this isn't an issue at
> all, provided that the generated code guarantees never to call anything outside
> of TM. For example, if generated code calls into the C library and the C
> library is compiled for Thumb-2, interworking calls will be required.

The second part is more interesting -- we'll be mostly calling mozilla code, but I can see a future world where we'd want to make more and more direct calls on trace. I don't think there's a big cost to generating iterworking branches, is there?

Though... I haven't thought this through all that much, but intra-trace branches would just be normal, and we already restore pc on return via ldmia which will handle the mode switch (on new enough architectures). I guess for ins_call we don't generate bx but just b? Is there any other change that would be needed?

There is a slight complexity cost if we want to support ARMv4 and ARMv4T cores.
 * ARMv4 (~ARM7) didn't have Thumb support, and doesn't understand the BX instruction.
 * ARMv4T (~ARM7TDMI) understands BX, but not BLX, so interworking has an associated cost on ARMv4T.
 * ARMv5 and onwards are much easier to deal with. They support BLX, which does everything in one instruction, but to honest I'm not sure if interworking impacts performance at all. I would guess that it doesn't, but I have no figure to back this up.

More details are available on a handy table on this page: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204i/Cihfddaf.html

--- ARMv4(T)

ARMv4 is pretty unusual, and I doubt that anyone would want to run TM on it. However, I have no figure to back this up so I can't say for sure. I think the best-known ARMv4 chip is the StrongARM. ARMv4T includes most ARM7 variants (such as the ARM7TDMI) and a few ARM9 variants. The GP2X, for example, has an ARMv4T ARM9 in it (according to Wikipedia).

--- Code

The code itself is easy enough to write, and I already have an old Tamarin patch to work from so it won't take me long to implement the interwork code itself. I won't be able to generate simple or automatable tests, though.

Intra-trace branches are no problem at all because the target is known to be ARM code.

--- Thumb2 for Trace Monkey Itself

I do think that compiling TM for Thumb-2 would be beneficial. Even if we don't do this now, it could be useful in the future. T2 gives you almost the performance of ARM with the code size of T1, so even though it's not supported on ARM1176, it could be worth doing.

In any case, if we want to support interworking branches anyway we will automatically support interworking into a potential future T2 TM.

Created an attachment (id=371456)
Generate interworking branches.

--- Patch Summary

* asm_call uses the new BLX method to generate interworking branches.
  - BLX() will generate interworking branches on ARMv4T, which doesn't support the actual BLX instruction.
  - BLX() is clever enough to generate simple (non-interworking) branches for ARMv4(T) when the target is not Thumb. This is always the case for plain ARMv4.

* The processor feature detection has been modified to allow detection of ARMv5 support (along with other features).
  - TODO: I can't test the WinCE code here. It looks Ok, but could someone try this out please?

* The patch works only if asm_call is the only function used to call out of generated code. That appears to be the case as it handles LIR_call, which is generated by insCall, but I'm not entirely familiar with the code so there could be some other way that calls can be generated.

* Existing branch mechanisms for branching within (and directly between) generated code fragments are unchanged, as the target will always be ARM code and interworking is not required.

* The exit branches were generated as "BX"; this won't work for ARMv4, so I've added a fix for this too.

* A number of asm_output() calls have been tidied up to be more informative of what's actually generated.

--- Bug Reproduction

Whilst I could hack the generated makefiles to get TM itself in Thumb, the generated code always seems to be far enough away from the native code that a simple BLX is never generated. The original code used "LDR PC, =addr" in this case, and on ARMv5+ this type of branch supports interworking so the behaviour is correct. (I don't have an ARMv4T platform to test on.) However, this is still a potential bug as we can't guarantee the memory locations of each code element.

To clarify:
 * The bug will manifest on ARMv5+ if the target address is within 32MB of the branch instruction _and_ the target is Thumb (or Thumb-2) code.
 * The bug will manifest on ARMv4T regardless, as LDR PC does not interwork on this architecture.
 * A slightly different bug will manifest on ARMv4, but that should be fixed by my patch.

--- Old Architectures

There's a lot of code in there to handle ARMv4(T). This does make me wonder if it's worth supporting these. What is the policy on this? Can we assume ARMv5+, or do we need the ARMv4 support code?

If we don't need to support ARMv4, I'll strip that stuff out of the patch to make it simpler and cleaner.

Note: I think Tamarin supported ARMv4T, but I don't know about ARMv4.

--- Testing

No additional failures arise from adding this patch on either trace-tests.js or js/tests.

(From update of attachment 371456)
Most of the changes in the patch actually relate to platform detection, so I'll split that out as a separate patch.

Created an attachment (id=371687)
Simplified patch.

Bug 487416 now covers the feature detection, so this new patch is cleaner than the previous one.

(From update of attachment 371687)
Looks great!

Created an attachment (id=371815)
A couple of minor corrections.

Oops... My previous patch had an erroneous underrunProtect and some broken comments. The new one fixes this.

It probably doesn't need another review but I've tagged it anyway to be on the safe side.

(From update of attachment 371815)
Bah, I even looked at that underrunProtect and convinced myself that it was right somehow :p

Created an attachment (id=376694)
Remove ARM4(T) support.

We have previously decided that we don't need to support ARMv4(T). My previous patch included a significant amount of code for handling ARMv4(T), so I would like to remove this. The attached patch will work on ARMv5+.

Created an attachment (id=376695)
Removes ARMv4(T) support. (Fixes a comment.)

vlad, could we get a review on the final patch here?

(From update of attachment 376695)
Oh, perfect -- I actually forgot about this patch. The previous BL impl for a long offset was horrible for perf; note that nothing uses BL() any more, so can you just nuke it?

I'm not sure if the BLX impl with a 24-bit offset is correct though; BLX with an immediate offset will always exchange instruction modes, as best I can tell. I think we need to check whether we're branching to Thumb or ARM code, and emit BLX or BL as appropriate, right? Also, for the case where we need to do a LD32, make the underrunProtect(4+LD32_size) instead of hardcoding 12.

Here's the code that I had:

    intptr_t offs = PC_OFFSET_FROM((addr&~3),_nIns-1);

    if (isS24(offs>>2)) {
        // we already did an underrunProtect(4) above

        if (AvmCore::config.thumb &&
            ((int32_t)addr) & 0x1 == 1)
        {
            // we need to branch to thumb, so emit BLX here
            int32_t h = (((int32_t) addr) & 0x2) >> 1;
            // BLX addr (via offs & H)
            *(--_nIns) = (NIns)( (0xFA | h)<<24 | ((offs>>2) & 0xFFFFFF) );
            asm_output("blx %p", addr);
        } else {
            // just a normal BL
            // BL addr (via offs)
            *(--_nIns) = (NIns)( COND_AL | (0xB<<24) | ((offs>>2) & 0xFFFFFF) );
            asm_output("bl %p", addr);
        }
    } else {
        if (AvmCore::config.thumb) {
            underrunProtect(4+LD32_size);

            // BLX IP
            *(--_nIns) = (NIns)( COND_AL | (0x12FFF3<<4) | IP );
            asm_output("blx ip");
        } else {
            underrunProtect(8+LD32_size);

            MOV(PC, IP);
            MOV(LR, PC);
        }

        LD32_nochk(IP, (int32_t)addr);
    }

Yep, BL can be removed, as in bug 487607.

---

Argh! Yes, you're right! Not sure how I managed that. I'll fix the patch.

---

Using 4+LD32_size is good; I didn't know about LD32_size when I wrote the patch.

Created an attachment (id=378068)
Fixes BLX for ARM targets.

Mostly the same but adjusts some comments and emits BL if the target is ARM.

(From update of attachment 378068)

>+ }
>+ } else {
>+ // BLX IP
>+ *(--_nIns) = (NIns)( (COND_AL) | (0x12<<20) | (0x3<<4) | (IP) );
>+ asm_output("blx IP (=%p)", addr);

I missed this earlier -- this needs to be 0x12<<20 | 0xFFF << 8 | 0x3<<4 (or just 0x12FFF3 << 4) -- the 12 bits there are SBO, not SBZ, right?

Also, nit -- can you make that IP lowercase in the output? (The other register names are all lowercase in the asm output)

Created an attachment (id=378301)
Fixes BLX encoding and uses lower case register names in debug output.

Yep, you're right again! I read "SBO" as "should be zero", even though it clearly doesn't mean that!

Interestingly, the code still worked before; it seems that current processors don't really care what those bits are set to, though this may break if they are used in the future.

(From update of attachment 378301)
Looks good, thanks for making the changes -- sorry I didn't catch the SBO issue last review time!

Vlad, will this work on WinCE? I hear that the WinCE emulator only supports ARMv4 with one or two ARMv5 instructions, but I don't know if that includes BLX.

If not, we'll have to wrap it in an #ifdef or something ugly like that, and emit BL for WinCE (since it doesn't use Thumb anyway).

The emulator's BLX implementation is actually broken or somesuch. However, for us at least, I don't think we care -- it's basically impossible to run Fennec/Firefox/whatever in Microsoft's emulator; it's just too slow.

Dave Martin (dave-martin-arm) wrote :

Binary package hint: xulrunner-1.9.1

The problem here is that the nanojit output tries to call functions in xulrunner using plain BL instructions, causing the CPU to interpret the called (Thumb-2) code as ARM... which doesn't work.

The symptoms are random SIGILLs and segfaults in firefox-3.5.

Apparently, the required support has been implemented in mozilla-central for the last few months, but unfortunately there is not a straightforward patch against the current Ubuntu source to fix this :(

Possible workarounds:
 1) Update to the firefox 3.6 / mozilla 1.9.2 branch on mozilla-central (http://hg.mozilla.org/releases/mozilla-1.9.2/ ... it's still in beta though, and there are unresolved issues about openjdk java plugin support --- see https://blueprints.launchpad.net/ubuntu/+spec/desktop-lucid-new-firefox-support-model)
 2) Build all of xulrunner with -marm (firefox may also need -marm --- it depends on whether the nanojit output might call absolutely anything or just support functions in xulrunner)

The most recent relevant commit on mozilla-central is http://hg.mozilla.org/mozilla-central/rev/f8ce8b4d6ce1
It looks like the relevant changes weren't on the mozilla 1.9.1 / firefox 3.5 branch.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu lucid (development branch)
Release: 10.04
Codename: lucid

$ apt-cache policy xulrunner-1.9.1 firefox-3.5
xulrunner-1.9.1:
  Installed: 1.9.1.5+nobinonly-0ubuntu2
  Candidate: 1.9.1.5+nobinonly-0ubuntu2
  Version table:
 *** 1.9.1.5+nobinonly-0ubuntu2 0
        500 http://ports.ubuntu.com/main Packages
        100 /var/lib/dpkg/status
firefox-3.5:
  Installed: 3.5.5+nobinonly-0ubuntu2
  Candidate: 3.5.5+nobinonly-0ubuntu2
  Version table:
 *** 3.5.5+nobinonly-0ubuntu2 0
        500 http://ports.ubuntu.com lucid/main Packages
        100 /var/lib/dpkg/status

Changed in xulrunner-1.9.1 (Ubuntu):
assignee: nobody → Alexander Sack (asac)
Paul Larson (pwlars) on 2009-12-01
Changed in xulrunner-1.9.1 (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Changed in xulrunner:
status: Unknown → Fix Released
Paul Larson (pwlars) on 2009-12-04
Changed in xulrunner-1.9.1 (Ubuntu Lucid):
milestone: none → lucid-alpha-2
tags: added: iso-testing
Paul Larson (pwlars) on 2010-01-15
Changed in xulrunner-1.9.1 (Ubuntu Lucid):
milestone: lucid-alpha-2 → lucid-alpha-3
Alexander Sack (asac) on 2010-02-12
Changed in xulrunner-1.9.1 (Ubuntu Lucid):
milestone: lucid-alpha-3 → none
status: Triaged → Fix Committed
Alexander Sack (asac) on 2010-04-09
Changed in xulrunner-1.9.1 (Ubuntu Lucid):
status: Fix Committed → Won't Fix
Loïc Minier (lool) on 2010-07-12
tags: added: thumb
Changed in xulrunner:
importance: Unknown → Medium
Alexander Sack (asac) on 2011-01-06
Changed in xulrunner-1.9.1 (Ubuntu):
assignee: Alexander Sack (asac) → Chris Coulson (chrisccoulson)
Steve Langasek (vorlon) on 2011-02-15
tags: added: arm-porting-queue
Martin Pitt (pitti) wrote :

Fixed upstream long ago, assuming fixed in current xulrunner 2.0.

Changed in xulrunner-1.9.1 (Ubuntu):
assignee: Chris Coulson (chrisccoulson) → nobody
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.