Bug #303232 “armel gcc default optimisations” : Bugs : gcc-4.4 package : Ubuntu

Revision history for this message

Matthias Klose (doko) wrote on 2008-11-28: Re: [Bug 303232] [NEW] armel gcc default optimisations

#1

Catalin Marinas schrieb:
> The current patches from Debian applied to gcc change the default target
> to ARMv4t (arm9tdmi) in the gcc/config/arm/linux-eabi.h file (used,
> generally, by gcc/config/arm/arm.c to define the default architecture
> target and CPU tuning).

the debian patch is not applied in the ubuntu build.

Loïc Minier (lool) on 2009-01-20

Changed in gcc-4.3:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Matthias Klose (doko) wrote on 2009-01-20:

#2

no, it's not triaged.

- is -mtune=cortex-a8 our choice? how does it affect other processors?
- is -march=armv5t needed? afaiu we cannot run anymore on the thecus (Xscale)

Changed in gcc-4.3:
status:	Triaged → Incomplete

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-20:

#3

The -mtune=cortex-a8 will optimise the generated code for the Cortex-A8 (ARMv7 processor) pipeline but not affecting the instruction set used (which is still ARMv5T). The resulting code will be optimal on ARMv7 but there may be a slight drop (if any) on other architecture versions.

Without the Debian gcc patches for forcing ARMv4T, the default in gcc I think is ARMv5T anyway.

AFAIK, the Xscale platform is ARMv5T but without the VFP coprocessor (affected by the -mfpu=vfp option). Is Thecus planned to be a supported platform by the upcoming ARM Ubuntu port?

Revision history for this message

Loïc Minier (lool) wrote on 2009-01-20:

#4

Catalin, we'd like to support the Thecus N2100 and other XScale IOP32x with the -iop32x kernel flavour. The N2100 has a IOP 80219 which is ARMv5TE and like you I think it lacks a FPU.

Debian uses -mfloat-abi=soft; I understand that using -mfloat-abi=softfp creates binaries which are compatible with Debian's, but there's a performance hit on systems which don't have a FPU as floating point instructions generate a kernel trap to emulate them.

Matthias, what's the current setup?

I don't mind tuning for ARMv7 at all (probably makes little difference), but the FPU question is harder: it's a big hit for systems without a FPU to meet fp instructions, and it's probably a comparable big win for systems with a FPU to use fp instructions instead of full soft emulation.
Would it be possible to have two libgccs with one doing full software emulation, and another one using the FPU? This would probably allow us to use -mfloat-abi=soft and still benefit from the FPU to some degree on systems having one.

Changed in gcc-4.3:
status:	Incomplete → Confirmed

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-22:

#5

By default, gcc generates software floating point, so for this particular case no additional command line options are needed.

Using -mfloat-abi=softfp indeed creates binaries compatible with Debian (soft-float) since the function calling convention uses standard registers and stack for passing floating point arguments rather than VFP registers.

The VFP (hard-float) instructions are not emulated by the kernel (only the older FPA but they are no longer used by Debian armel), though much of the support for emulation is already in there. Currently, the kernel generates a SIGILL if such VFP instruction is encountered and the CPU doesn't support it.

Revision history for this message

Loïc Minier (lool) wrote on 2009-01-22:

#6

Right, so -mfloat-abi=softfp generates binaries which use compatible calling conventions but do require a VFP.

I don't think we want this; instead we should rather optimize libs and programs to select VFP at runtime if available or provide alternate packages for VFP versus non VFP systems. One obvious candidate is libc.

I wonder whether it's possible and useful to build a libgcc which implements soft float computations with the FPU? That would seem like a good thing to do to optimize all soft float calls on systems which have a FPU.

Revision history for this message

Loïc Minier (lool) wrote on 2009-01-22:

#7

(Typo s/VFP/FPU on the first line above, sorry.)

Does someone have pointers on the VFP trap handling kernel patches and on the issues with them (I guess there are issues if the patches aren't in the mainline)?

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-22: Re: [Bug 303232] Re: armel gcc default optimisations

#8

If -mfpu=vfp is enabled, the compiler will generate VFP instructions in
the asm code directly rather than calls to the libgcc soft-float code.
Even if the libgcc soft-float function could be replaced with the VFP
instructions, you still get an additional branch to those operations and
probably lower performance than complete VFP optimisation. Please note
that I haven't tried this approach comment more on the performance.

The libm is a candidate for this optimisation but there are applications
that would themselves benefit from being compiled with VFP. As I
understand, there are difficulties in maintaining two separate variants
for some packages (like OpenOffice).

There are no patches to enable full VFP emulation. AFAIK, the Linux
kernel community weren't keen to get such patches merged into mainline
because of emulation performance reasons.

Just for clarification, the kernel currently needs to trap the VFP
exceptions for 3 reasons:

1. The VFP is disabled at a context switch and the first encounter of a
VFP instruction triggers an undefined exception. At this point, the
kernel saves the VFP registers for the old application and loads those
for the new one.

2. On VFPv2 (found on ARMv5 and ARMv6 processors), the hardware does not
implement full IEEE754 compliance. There are corner cases (like
denormalised numbers) which aren't supported by hardware but the kernel
traps and emulates them. Note that the VFPv3 (on ARMv7 processors) has
full support for the IEEE754 compliance and there is no need for
additional kernel emulation (though the code is still there since it's
harmless).

3. Floating point operations exception (e.g. divide by zero) if the user
application enabled them and the hardware supports them.

Because of point 2 above, we have almost all the emulation code needed.
The only missing part is the emulation of the VFP registers (the kernel
currently reads the hardware ones) and maybe some optimisation to read
ahead and emulate more than one instruction in the exception handler
before returning to user.
--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

If -mfpu=vfp is enabled, the compiler will generate VFP instructions in
the asm code directly rather than calls to the libgcc soft-float code.
Even if the libgcc soft-float function could be replaced with the VFP
instructions, you still get an additional branch to those operations and
probably lower performance than complete VFP optimisation. Please note
that I haven't tried this approach comment more on the performance.

The libm is a candidate for this optimisation but there are applications
that would themselves benefit from being compiled with VFP. As I
understand, there are difficulties in maintaining two separate variants
for some packages (like OpenOffice).

There are no patches to enable full VFP emulation. AFAIK, the Linux
kernel community weren't keen to get such patches merged into mainline
because of emulation performance reasons.

Just for clarification, the kernel currently needs to trap the VFP
exceptions for 3 reasons:

1. The VFP is disabled at a context switch and the first encounter of a
VFP instruction triggers an undefined exception. At this point, the
kernel saves the VFP registers for the old application and loads those
for the new one.

2. On VFPv2 (found on ARMv5 and ARMv6 processors), the hardware does not
implement full IEEE754 compliance. There are corner cases (like
denormalised numbers) which aren't supported by hardware but the kernel
traps and emulates them. Note that the VFPv3 (on ARMv7 processors) has
full support for the IEEE754 compliance and there is no need for
additional kernel emulation (though the code is still there since it's
harmless).

3. Floating point operations exception (e.g. divide by zero) if the user
application enabled them and the hardware supports them.

Because of point 2 above, we have almost all the emulation code needed.
The only missing part is the emulation of the VFP registers (the kernel
currently reads the hardware ones) and maybe some optimisation to read
ahead and emulate more than one instruction in the exception handler
before returning to user.
-- 
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-22:

#9

Please ignore the legal disclaimer at the end of the previous post (I used the wrong SMTP server and it got appended automatically). Thanks.

Revision history for this message

Loïc Minier (lool) wrote on 2009-01-22:

#10

[Ack, I do understand there's a performance hit when gong via libgcc, but I was hoping this could be a good compromise between not using hardware FPU at all and generating traps on systems without FPU.]

Concerning libm as a candidate for opts: libm is in libc which is definitely a candidate for an optimized version; for example with have an i686 version for the i386 arch which provides an alternate libm: http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist

Thanks for the details on supported traps; allow me to recap to make sure I understand this correctly

The mainline kernel (we're primarily looking at 2.6.28 in jaunty) supports some traps out of the box, but only:
- to handle VFP state saving and restoring across context switches (I guess this is to avoid saving/restoring the FPU regs when the switched to context doesn't need that)
- to emulate some corner cases not supported by all hardware FPUs
- on math errors
and we don't need to patch anything to get the above when we're using VFP instructions in programs. Perhaps we need to turn on some kernel CONFIG_s in our armel flavours though?

Otherwise, the mainline kernel can't emulate the base set of VFP instructions which ARMv5 and v6 cores with a FPU support on systems which lack a FPU (such as the Xscale example); we could perhaps patch this in, but it's not wanted in the mainline because it's too slow. I take it that the response from Linux developers is that we're supposed to not use VFP instructions in userspace on systems without a FPU?

I didn't find any flags in the gcc man page about VFPv2 or v3: I guess one can only tell gcc to generate instructions for the full VFP set or not at all.

Revision history for this message

Michael Casadevall (mcasadevall) wrote on 2009-01-26:

#11

Download full text (9.9 KiB)

I did some performance benchmarks with pybench on an ARMv7 board. To prevent any third party processes from interfering, the board was running Ubuntu in single user mode, and the stock glibc. I'll run another set of benchmarks with our glibc tuned with the proposed flags, and also do another set of benchmarks on my NSLU2 (XScale/ARMv5) to see what sorta performance hit we're going to see.

With our current CFLAGS:
* Round 1 done in 62.165 seconds.
* Round 2 done in 62.229 seconds.
* Round 3 done in 61.994 seconds.
* Round 4 done in 61.616 seconds.
* Round 5 done in 62.371 seconds.
* Round 6 done in 63.191 seconds.
* Round 7 done in 62.180 seconds.
* Round 8 done in 62.165 seconds.
* Round 9 done in 61.906 seconds.
* Round 10 done in 62.977 seconds.

Test minimum average operation overhead
-------------------------------------------------------------------------------
          BuiltinFunctionCalls: 1302ms 1317ms 2.58us 3.114ms
           BuiltinMethodLookup: 871ms 871ms 0.83us 3.645ms
                 CompareFloats: 837ms 974ms 0.81us 4.171ms
         CompareFloatsIntegers: 963ms 1052ms 1.17us 3.112ms
               CompareIntegers: 657ms 659ms 0.37us 6.304ms
        CompareInternedStrings: 667ms 670ms 0.45us 16.016ms
                  CompareLongs: 564ms 566ms 0.54us 3.641ms
                CompareStrings: 550ms 556ms 0.56us 10.780ms
                CompareUnicode: 548ms 551ms 0.74us 8.163ms
    ComplexPythonFunctionCalls: 1699ms 1750ms 8.75us 5.260ms
                 ConcatStrings: 1017ms 1099ms 2.20us 6.975ms
                 ConcatUnicode: 4336ms 4720ms 15.73us 4.897ms
               CreateInstances: 1454ms 1463ms 13.06us 4.242ms
            CreateNewInstances: 1267ms 1283ms 15.28us 3.627ms
       CreateStringsWithConcat: 742ms 750ms 0.75us 10.567ms
       CreateUnicodeWithConcat: 772ms 778ms 1.94us 4.172ms
                  DictCreation: 695ms 695ms 1.74us 4.172ms
             DictWithFloatKeys: 1081ms 1086ms 1.21us 7.899ms
           DictWithIntegerKeys: 767ms 771ms 0.64us 10.566ms
            DictWithStringKeys: 668ms 672ms 0.56us 10.567ms
                      ForLoops: 671ms 672ms 26.88us 0.685ms
                    IfThenElse: 533ms 534ms 0.40us 7.898ms
                   ListSlicing: 557ms 563ms 40.19us 0.642ms
                NestedForLoops: 764ms 771ms 0.51us 0.247ms
      NestedListComprehensions: 1129ms 1148ms 95.63us 1.012ms
          NormalClassAttribute: 761ms 771ms 0.64us 5.271ms
       NormalInstanceAttribute: 697ms 697ms 0.58us 5.278ms
           PythonFunctionCalls: 693ms 698ms 2.11us 3.125ms
             PythonMethodCalls: 1568ms 1574ms 7.00us 1.566ms
                     Recursion: 1037ms 1043ms 20.86us 5.249ms
                  SecondImport: 1380ms 1382ms 13.82us 2.052ms
           SecondPackageImport: 1408ms 1411ms 14.11us 2.052ms
         SecondSu...

I did some performance benchmarks with pybench on an ARMv7 board. To prevent any third party processes from interfering, the board was running Ubuntu in single user mode, and the stock glibc. I'll run another set of benchmarks with our glibc tuned with the proposed flags, and also do another set of benchmarks on my NSLU2 (XScale/ARMv5) to see what sorta performance hit we're going to see.

With our current CFLAGS:
* Round 1 done in 62.165 seconds.
* Round 2 done in 62.229 seconds.
* Round 3 done in 61.994 seconds.
* Round 4 done in 61.616 seconds.
* Round 5 done in 62.371 seconds.
* Round 6 done in 63.191 seconds.
* Round 7 done in 62.180 seconds.
* Round 8 done in 62.165 seconds.
* Round 9 done in 61.906 seconds.
* Round 10 done in 62.977 seconds.

Test                             minimum  average  operation  overhead
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:   1302ms   1317ms    2.58us    3.114ms
           BuiltinMethodLookup:    871ms    871ms    0.83us    3.645ms
                 CompareFloats:    837ms    974ms    0.81us    4.171ms
         CompareFloatsIntegers:    963ms   1052ms    1.17us    3.112ms
               CompareIntegers:    657ms    659ms    0.37us    6.304ms
        CompareInternedStrings:    667ms    670ms    0.45us   16.016ms
                  CompareLongs:    564ms    566ms    0.54us    3.641ms
                CompareStrings:    550ms    556ms    0.56us   10.780ms
                CompareUnicode:    548ms    551ms    0.74us    8.163ms
    ComplexPythonFunctionCalls:   1699ms   1750ms    8.75us    5.260ms
                 ConcatStrings:   1017ms   1099ms    2.20us    6.975ms
                 ConcatUnicode:   4336ms   4720ms   15.73us    4.897ms
               CreateInstances:   1454ms   1463ms   13.06us    4.242ms
            CreateNewInstances:   1267ms   1283ms   15.28us    3.627ms
       CreateStringsWithConcat:    742ms    750ms    0.75us   10.567ms
       CreateUnicodeWithConcat:    772ms    778ms    1.94us    4.172ms
                  DictCreation:    695ms    695ms    1.74us    4.172ms
             DictWithFloatKeys:   1081ms   1086ms    1.21us    7.899ms
           DictWithIntegerKeys:    767ms    771ms    0.64us   10.566ms
            DictWithStringKeys:    668ms    672ms    0.56us   10.567ms
                      ForLoops:    671ms    672ms   26.88us    0.685ms
                    IfThenElse:    533ms    534ms    0.40us    7.898ms
                   ListSlicing:    557ms    563ms   40.19us    0.642ms
                NestedForLoops:    764ms    771ms    0.51us    0.247ms
      NestedListComprehensions:   1129ms   1148ms   95.63us    1.012ms
          NormalClassAttribute:    761ms    771ms    0.64us    5.271ms
       NormalInstanceAttribute:    697ms    697ms    0.58us    5.278ms
           PythonFunctionCalls:    693ms    698ms    2.11us    3.125ms
             PythonMethodCalls:   1568ms   1574ms    7.00us    1.566ms
                     Recursion:   1037ms   1043ms   20.86us    5.249ms
                  SecondImport:   1380ms   1382ms   13.82us    2.052ms
           SecondPackageImport:   1408ms   1411ms   14.11us    2.052ms
         SecondSubmoduleImport:   1600ms   1602ms   16.02us    2.052ms
       SimpleComplexArithmetic:   1581ms   1584ms    1.80us    4.171ms
        SimpleDictManipulation:    778ms    782ms    0.65us    5.248ms
         SimpleFloatArithmetic:   1415ms   1418ms    1.07us    6.303ms
      SimpleIntFloatArithmetic:    659ms    660ms    0.50us    6.304ms
       SimpleIntegerArithmetic:    659ms    661ms    0.50us    6.305ms
      SimpleListComprehensions:    932ms    947ms   78.88us    1.015ms
        SimpleListManipulation:    642ms    646ms    0.55us    6.840ms
          SimpleLongArithmetic:    621ms    637ms    0.97us    3.111ms
                    SmallLists:    984ms   1000ms    1.47us    4.172ms
                   SmallTuples:   1038ms   1043ms    1.93us    4.700ms
         SpecialClassAttribute:    754ms    755ms    0.63us    5.272ms
      SpecialInstanceAttribute:    818ms    819ms    0.68us    5.277ms
                StringMappings:   1426ms   1428ms    5.66us    4.474ms
              StringPredicates:   1303ms   1326ms    1.89us   15.920ms
                 StringSlicing:    809ms    864ms    1.54us    9.299ms
                     TryExcept:    658ms    660ms    0.29us    7.895ms
                    TryFinally:   1530ms   1532ms    9.58us    4.263ms
                TryRaiseExcept:   1196ms   1203ms   18.79us    4.172ms
                  TupleSlicing:    728ms    733ms    2.79us    0.424ms
               UnicodeMappings:    703ms    705ms   19.59us    3.839ms
             UnicodePredicates:   1390ms   1396ms    2.58us   19.107ms
             UnicodeProperties:   1723ms   1729ms    4.32us   15.925ms
                UnicodeSlicing:    973ms   1467ms    2.99us    8.250ms
                   WithFinally:   1457ms   1458ms    9.11us    4.259ms
               WithRaiseExcept:   1663ms   1681ms   21.02us    5.357ms

-------------------------------------------------------------------------------
Totals:                          60691ms  62279ms

With the proposed CFLAGS:
* Round 1 done in 60.513 seconds.
* Round 2 done in 60.353 seconds.
* Round 3 done in 61.784 seconds.
* Round 4 done in 60.537 seconds.
* Round 5 done in 60.090 seconds.
* Round 6 done in 59.704 seconds.
* Round 7 done in 60.323 seconds.
* Round 8 done in 60.244 seconds.
* Round 9 done in 60.026 seconds.
* Round 10 done in 58.853 seconds.
Average of 60.243 seconds per test run

Test                             minimum  average  operation  overhead
-------------------------------------------------------------------------------
          BuiltinFunctionCalls:   1234ms   1242ms    2.43us    1.986ms
           BuiltinMethodLookup:    846ms    867ms    0.83us    2.322ms
                 CompareFloats:    918ms   1066ms    0.89us    2.654ms
         CompareFloatsIntegers:    876ms    974ms    1.08us    1.986ms
               CompareIntegers:    681ms    682ms    0.38us    3.988ms
        CompareInternedStrings:    694ms    694ms    0.46us   10.148ms
                  CompareLongs:    548ms    548ms    0.52us    2.320ms
                CompareStrings:    564ms    564ms    0.56us    6.895ms
                CompareUnicode:    562ms    562ms    0.75us    5.259ms
    ComplexPythonFunctionCalls:   1632ms   1710ms    8.55us    3.338ms
                 ConcatStrings:    960ms   1099ms    2.20us    5.021ms
                 ConcatUnicode:   4146ms   4732ms   15.77us    3.836ms
               CreateInstances:   1411ms   1433ms   12.79us    2.719ms
            CreateNewInstances:   1296ms   1314ms   15.65us    2.496ms
       CreateStringsWithConcat:    760ms    763ms    0.76us    6.674ms
       CreateUnicodeWithConcat:    728ms    751ms    1.88us    2.655ms
                  DictCreation:    644ms    645ms    1.61us    2.653ms
             DictWithFloatKeys:    914ms    916ms    1.02us    4.999ms
           DictWithIntegerKeys:    723ms    723ms    0.60us    6.674ms
            DictWithStringKeys:    690ms    690ms    0.57us    6.676ms
                      ForLoops:    687ms    687ms   27.48us    0.460ms
                    IfThenElse:    553ms    553ms    0.41us    4.999ms
                   ListSlicing:    558ms    560ms   40.00us    0.653ms
                NestedForLoops:    780ms    781ms    0.52us    0.165ms
      NestedListComprehensions:    987ms   1044ms   87.01us    0.669ms
          NormalClassAttribute:    779ms    780ms    0.65us    3.342ms
       NormalInstanceAttribute:    719ms    721ms    0.60us    3.350ms
           PythonFunctionCalls:    713ms    715ms    2.17us    1.998ms
             PythonMethodCalls:   1509ms   1523ms    6.77us    1.025ms
                     Recursion:    944ms    945ms   18.90us    3.322ms
                  SecondImport:   1343ms   1346ms   13.46us    1.342ms
           SecondPackageImport:   1352ms   1355ms   13.55us    1.375ms
         SecondSubmoduleImport:   1589ms   1594ms   15.94us    1.373ms
       SimpleComplexArithmetic:    909ms    914ms    1.04us    2.764ms
        SimpleDictManipulation:    712ms    714ms    0.59us    3.474ms
         SimpleFloatArithmetic:    961ms    966ms    0.73us    4.156ms
      SimpleIntFloatArithmetic:    527ms    527ms    0.40us    4.162ms
       SimpleIntegerArithmetic:    527ms    527ms    0.40us    4.165ms
      SimpleListComprehensions:    874ms    887ms   73.88us    0.701ms
        SimpleListManipulation:    588ms    589ms    0.50us    4.509ms
          SimpleLongArithmetic:    610ms    611ms    0.93us    2.068ms
                    SmallLists:    962ms    968ms    1.42us    2.768ms
                   SmallTuples:   1024ms   1039ms    1.92us    3.112ms
         SpecialClassAttribute:    775ms    776ms    0.65us    3.344ms
      SpecialInstanceAttribute:    893ms    899ms    0.75us    3.349ms
                StringMappings:   1379ms   1396ms    5.54us    3.128ms
              StringPredicates:   2549ms   2549ms    3.64us   12.080ms
                 StringSlicing:    743ms    846ms    1.51us    6.583ms
                     TryExcept:    728ms    728ms    0.32us    5.211ms
                    TryFinally:   1459ms   1461ms    9.13us    2.770ms
                TryRaiseExcept:   1217ms   1235ms   19.29us    2.763ms
                  TupleSlicing:    650ms    679ms    2.59us    0.449ms
               UnicodeMappings:    689ms    692ms   19.22us    4.300ms
             UnicodePredicates:   1027ms   1028ms    1.90us   14.498ms
             UnicodeProperties:   1294ms   1305ms    3.26us   12.075ms
                UnicodeSlicing:    928ms   1316ms    2.69us    5.849ms
                   WithFinally:   1377ms   1379ms    8.62us    2.769ms
               WithRaiseExcept:   1626ms   1631ms   20.39us    3.467ms

That being said, the initial results, while having an improvement, are not very impressive, and I suspect we'll be seeing a reduction in performance on the NSLU2 due to being tuned against features its core doesn't get. I'll post more resorts once I have rebuilt glibc.

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-01-27:

#12

Interesting results, but I guess we would not expect to see much of a performance increase by building python with hardware floating-point. It's an interpreter, so I expect that floating-point instructions are going to account for only a small proportion of the code executed, even when the python program being executed by the interpreter is doing some floating-point number-crunching.

If you can post the XScale results then that would be interesting: this would give a good indication of how the performance of general-purpose code will be affected when running code built with the proposed options on ARM9 platforms.

To get a better idea of the effect of building specific components for VFP, a package which does a lot of heavy floating point internally would be more interesting--- if we can obtain benchmarks for some backend libraries such as the following, this may give us a better overall idea what performance improvements would be possible by building VFP-enabled versions of some components:
    freetype
    pango
    cairo
    media backends (not so sure here, but maybe, but possibly libraries such as ffmpeg, vorbis, fftw)
    spidermonkey JavaScript engine (http://ftp.mozilla.org/pub/mozilla.org/js/)

I haven't looked into these in detail yet.

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-28:

#13

On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> [Ack, I do understand there's a performance hit when gong via libgcc,
> but I was hoping this could be a good compromise between not using
> hardware FPU at all and generating traps on systems without FPU.]

I haven't looked but I think there is more work here as the softfloat
function in libgcc would need to be rewritten to use VFP instructions.
Compiling with -mfpu=vfp simply ignores those functions.

> Concerning libm as a candidate for opts: libm is in libc which is
> definitely a candidate for an optimized version; for example with have
> an i686 version for the i386 arch which provides an alternate libm:
> http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist

Could we not have the same for other packages that would benefit from
VFP? I'm not too familiar with the version conflicts in the .deb
packages but, for example, could we have something like a dummy
openoffice package which depends on either openoffice-softfp or
openoffice-vfp?

> The mainline kernel (we're primarily looking at 2.6.28 in jaunty)
> supports some traps out of the box, but only:
> - to handle VFP state saving and restoring across context switches (I
> guess this is to avoid saving/restoring the FPU regs when the switched
> to context doesn't need that)
> - to emulate some corner cases not supported by all hardware FPUs
> - on math errors
> and we don't need to patch anything to get the above when we're using
> VFP instructions in programs. Perhaps we need to turn on some kernel
> CONFIG_s in our armel flavours though?

You need to have CONFIG_VFP on. This could always be on by default as
the kernel checks for the hardware feature before enabling the
corresponding hwcap bit.

> Otherwise, the mainline kernel can't emulate the base set of VFP
> instructions which ARMv5 and v6 cores with a FPU support on systems
> which lack a FPU (such as the Xscale example); we could perhaps patch
> this in, but it's not wanted in the mainline because it's too slow. I
> take it that the response from Linux developers is that we're supposed
> to not use VFP instructions in userspace on systems without a FPU?

Probably that's the reason as we now have a fast softfloat
implementation. This is the relevant thread (though not many opinions):

http://thread.gmane.org/gmane.linux.ports.arm.kernel/47219

> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
> one can only tell gcc to generate instructions for the full VFP set or
> not at all.

By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
think that's what should be used as VFPv3 only comes with ARMv7.

On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> [Ack, I do understand there's a performance hit when gong via libgcc,
> but I was hoping this could be a good compromise between not using
> hardware FPU at all and generating traps on systems without FPU.]

I haven't looked but I think there is more work here as the softfloat
function in libgcc would need to be rewritten to use VFP instructions.
Compiling with -mfpu=vfp simply ignores those functions.

> Concerning libm as a candidate for opts: libm is in libc which is
> definitely a candidate for an optimized version; for example with have
> an i686 version for the i386 arch which provides an alternate libm:
> http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist

Could we not have the same for other packages that would benefit from
VFP? I'm not too familiar with the version conflicts in the .deb
packages but, for example, could we have something like a dummy
openoffice package which depends on either openoffice-softfp or
openoffice-vfp?

> The mainline kernel (we're primarily looking at 2.6.28 in jaunty)
> supports some traps out of the box, but only:
> - to handle VFP state saving and restoring across context switches (I
> guess this is to avoid saving/restoring the FPU regs when the switched
> to context doesn't need that)
> - to emulate some corner cases not supported by all hardware FPUs
> - on math errors
> and we don't need to patch anything to get the above when we're using
> VFP instructions in programs.  Perhaps we need to turn on some kernel
> CONFIG_s in our armel flavours though?

You need to have CONFIG_VFP on. This could always be on by default as
the kernel checks for the hardware feature before enabling the
corresponding hwcap bit.

> Otherwise, the mainline kernel can't emulate the base set of VFP
> instructions which ARMv5 and v6 cores with a FPU support on systems
> which lack a FPU (such as the Xscale example); we could perhaps patch
> this in, but it's not wanted in the mainline because it's too slow.  I
> take it that the response from Linux developers is that we're supposed
> to not use VFP instructions in userspace on systems without a FPU?

Probably that's the reason as we now have a fast softfloat
implementation. This is the relevant thread (though not many opinions):

http://thread.gmane.org/gmane.linux.ports.arm.kernel/47219

> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
> one can only tell gcc to generate instructions for the full VFP set or
> not at all.

By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
think that's what should be used as VFPv3 only comes with ARMv7.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-01-28:

#14

On Wed, Jan 28, 2009 at 11:17:27AM -0000, Catalin Marinas wrote:
> On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> > Concerning libm as a candidate for opts: libm is in libc which is
> > definitely a candidate for an optimized version; for example with have
> > an i686 version for the i386 arch which provides an alternate libm:
> > http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist
>
> Could we not have the same for other packages that would benefit from
> VFP? I'm not too familiar with the version conflicts in the .deb
> packages but, for example, could we have something like a dummy
> openoffice package which depends on either openoffice-softfp or
> openoffice-vfp?

OpenOffice.org's packaging is already very complex, consisting of 57
binary packages with complex dependencies both among its own packages
and between them and others. I think it is unlikely that adding FP
variants of many of those with different binary package names would
successfully meet anyone's requirements.

Even elsewhere, this approach is fraught with practical problems. We can
take this approach successfully with glibc, but glibc is a very special
case. We can cope with having libc6-i686, libc6-sparcv9, and the like
because libc6 is always installed and essentially never changes its
SONAME, and so the installer can include special-case code to make sure
that optimised variants of it are installed; we can easily extend that
code for ARM, and I expect it will make sense to do so. This isn't
something we can generalise in any straightforward way, though. The
package management system used on the system after installation (i.e.
dpkg, apt, and friends) doesn't know how to identify and select packages
appropriate for variants of a particular architecture, and so there
would be no way to guarantee that the proper optimised packages get
installed when a user asks for the dummy name.

(Attempting to add this feature would run into a number of interesting
and long-standing roadblocks such as the lack of versioned Provides, and
I suspect would start to approach "multiarch", which is a huge and
somewhat technically-controversial project.)

This problem applies to separate binary packages, but not to hwcap
optimisations in general, of course. As long as hwcap works (whether in
the runtime linker or a similar implementation elsewhere, e.g.
GStreamer), we can just put the optimised objects in the same package as
the unoptimised ones, as long as there aren't too many variants and you
don't mind the binary package size growing a bit. Since I understand
that we aren't likely to be supporting normal CD images for ARM, this
isn't as much of a problem as it would be for other architectures.

(OOo itself might still be a problem due to sheer size and the utter
horribleness of its build system, but I assume you meant it more as an
example than as an initial target!)

On Wed, Jan 28, 2009 at 11:17:27AM -0000, Catalin Marinas wrote:
> On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> > Concerning libm as a candidate for opts: libm is in libc which is
> > definitely a candidate for an optimized version; for example with have
> > an i686 version for the i386 arch which provides an alternate libm:
> > http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist
> 
> Could we not have the same for other packages that would benefit from
> VFP? I'm not too familiar with the version conflicts in the .deb
> packages but, for example, could we have something like a dummy
> openoffice package which depends on either openoffice-softfp or
> openoffice-vfp?

OpenOffice.org's packaging is already very complex, consisting of 57
binary packages with complex dependencies both among its own packages
and between them and others. I think it is unlikely that adding FP
variants of many of those with different binary package names would
successfully meet anyone's requirements.

Even elsewhere, this approach is fraught with practical problems. We can
take this approach successfully with glibc, but glibc is a very special
case. We can cope with having libc6-i686, libc6-sparcv9, and the like
because libc6 is always installed and essentially never changes its
SONAME, and so the installer can include special-case code to make sure
that optimised variants of it are installed; we can easily extend that
code for ARM, and I expect it will make sense to do so. This isn't
something we can generalise in any straightforward way, though. The
package management system used on the system after installation (i.e.
dpkg, apt, and friends) doesn't know how to identify and select packages
appropriate for variants of a particular architecture, and so there
would be no way to guarantee that the proper optimised packages get
installed when a user asks for the dummy name.

(Attempting to add this feature would run into a number of interesting
and long-standing roadblocks such as the lack of versioned Provides, and
I suspect would start to approach "multiarch", which is a huge and
somewhat technically-controversial project.)

This problem applies to separate binary packages, but not to hwcap
optimisations in general, of course. As long as hwcap works (whether in
the runtime linker or a similar implementation elsewhere, e.g.
GStreamer), we can just put the optimised objects in the same package as
the unoptimised ones, as long as there aren't too many variants and you
don't mind the binary package size growing a bit. Since I understand
that we aren't likely to be supporting normal CD images for ARM, this
isn't as much of a problem as it would be for other architectures.

(OOo itself might still be a problem due to sheer size and the utter
horribleness of its build system, but I assume you meant it more as an
example than as an initial target!)

Revision history for this message

Matthias Klose (doko) wrote on 2009-01-28:

#15

Catalin Marinas schrieb:
>> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
>> one can only tell gcc to generate instructions for the full VFP set or
>> not at all.
>
> By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
> think that's what should be used as VFPv3 only comes with ARMv7.

Is there a known GCC multilib config for non-vfp/vfp?

Revision history for this message

Matthias Klose (doko) wrote on 2009-01-28:

#16

compared pybench and pystone benchmark results. Compared to a default to armv5 with the proposed flags, together with a glibc built with the proposed flags, you do see 4-5% improvement in the benchmarks. Note that these benchmarks don't use floating point.

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-01-28:

#17

Which board was this on?

Revision history for this message

Matthias Klose (doko) wrote on 2009-01-28:

#18

Dave Martin schrieb:
> Which board was this on?

my python benchmarks were done on the babbage board.

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-01-28:

#19

This seems like a plausible amount of improvement for Cortex-A8 tuning on a non-floating-point-intensive package and library stack.

Did you get a chance to try interoperable non-vfp code on ARM9/XScale?
(i.e., -march=armv5 -mtune=cortex-a8... I'm assuming -msoft-float is present by default).

It would be worth seeing whether the Cortex-A8 tuning has any appreciable impact on performance on these platforms.

Revision history for this message

Catalin Marinas (catalin-marinas) wrote on 2009-01-29:

#20

Thinking a bit more, I think the VFP emulation for older hardware could
be a viable option to have all the packages compiled with -mfpu=vfp. I
can't estimate the performance difference between softfloat and VFP
emulation but, if the performance on Xscale is visibly affected, we
still have the option of two builds of glibc (and other libraries) where
the default one is VFP and the additional one is softfloat. Since
-mfloat-abi=softfp, mixing VFP and softfloat packages is still possible.

As I said, all the VFP floating point operations are already implemented
in the kernel. There are around 20 additional instructions (actually
variants of a fewer number) for transferring data between VFP registers
and memory or ARM registers which aren't handled plus the VFP register
bank to be accessed from memory rather than the coprocessor registers.
As a rough estimate, I think it would take me around 2 weeks to get this
done.

Does this sound feasible? Thanks.

Revision history for this message

Loïc Minier (lool) wrote on 2009-02-04:

#21

Catalin, thanks for your proposal; we discussed this option this week since we were meeting with other Canonical people, notably Matthias and Colin.

On the idea to handle all VFP instructions as kernel traps on systems without a FPU, we think it will be too slow. In my experience, even alignment traps are a huge hit on performance when they happen, and FPU instructions would appear in a lot of random places. So while it would technically "work", it wouldn't be working at a decent level.

We're currently trying to assemble good benchmarks to decide about another solution, and we will get back to you as soon as we manage to have enough data. (We're using the mojo prebuilt hardy archives as a base to compare the impact of various opts in various combinations.)

However, the kernel traps would be an excellent safety net in case some binary with VFP instructions ends up on such systems, and it might be a good idea to have this support; but it wont be enough by itself to turn VFP instructions generation on in jaunty. :-/

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-02-13:

#22

I've been taking a look at package lists and the dependencies of key apps and libraries. The following look worth investigating with respect to VFP optimisations:

I haven't tidied this list up fully, so there is a mix of binary and source package names... I've done no benchmarking or profiling on these yet, so I can't guarantee that they're all relevant... anyway, here's my current list:

Desktop/Universal (applicable to pretty much any install with a GUI)
gcc libraries (libgcc* libstdc++* libgcj* etc.)
libc6 (i.e., glibc)
cairo2
freetype6
pango
pixman (uncertain how much floating-point this contains)
libjpeg* (uncertain how much floating-point this contains)
libpng* (uncertain how much floating-point this contains)
libgtk+* (uncertain how much floating-point this contains)
libgnome* (uncertain how much floating-point this contains)
libqt* (uncertain how much floating-point this contains; is this used outside kubuntu?)
libgl1-mesa-* (Probably we're expecting OEM's own GL acceleration to be used instead?)
libglu1-mesa (Probably we're expecting OEM's own GL acceleration to be used instead?)
xorg (probably not a priority; most of the potential benefit will come from SoC manufacturers' X drivers. Only certain extensions such as RENDER, COMPOSITE and GLX, and the core font rendering which is not much used these days, are likely to make much use of floating-point)
xscreensaver* (Not directly useful, but may impact user perception of performance. There is a separate, related issue: xscreensaver can be bad for battery-powered devices... I notice that xscreensaver-gl is in ubuntu-netbook-remix by default, which is can bad news especially where 3D acceleration is partial or absent. See also threads such as https://bugs.launchpad.net/ubuntu/+source/xscreensaver/+bug/174191)

Web (applicable to all standard installs)
firefox
xulrunner
openjdk-6*

Generic acceleration libraries (desirable)
liboil (number-crunching acceleration library used by pulseaudio, gstreamer etc.)

Media (desirable)
gstreamer --- I don't know the architecture of gstreamer, so I can't comment on all the various subordinate packages here
gstreamer*-ffmpeg
gstreamer*-mp3
pulseaudio
libsamplerate
libvorbis
libtheora ? (Don't know how much theora video is really out there)

Office / apps (desirable)
openoffice.org
gimp

I've intentionally disregarded whether packages are libraries or not, in order to produce a more complete picture. However, many of these packages _are_ libraries, so there is probably significant benefit to be had even if only library packages are built for VFP.

Cheers

I've been taking a look at package lists and the dependencies of key apps and libraries.  The following look worth investigating with respect to VFP optimisations:

I haven't tidied this list up fully, so there is a mix of binary and source package names...  I've done no benchmarking or profiling on these yet, so I can't guarantee that they're all relevant... anyway, here's my current list:

Desktop/Universal (applicable to pretty much any install with a GUI)
	gcc libraries (libgcc* libstdc++* libgcj* etc.)
	libc6 (i.e., glibc)
	cairo2
	freetype6
	pango
	pixman (uncertain how much floating-point this contains)
	libjpeg* (uncertain how much floating-point this contains)
	libpng* (uncertain how much floating-point this contains)
	libgtk+* (uncertain how much floating-point this contains)
	libgnome* (uncertain how much floating-point this contains)
	libqt* (uncertain how much floating-point this contains; is this used outside kubuntu?)
	libgl1-mesa-* (Probably we're expecting OEM's own GL acceleration to be used instead?)
	libglu1-mesa (Probably we're expecting OEM's own GL acceleration to be used instead?)
	xorg (probably not a priority; most of the potential benefit will come from SoC manufacturers' X drivers. Only certain extensions such as RENDER, COMPOSITE and GLX, and the core font rendering which is not much used these days, are likely to make much use of floating-point)
	xscreensaver* (Not directly useful, but may impact user perception of performance. There is a separate, related issue: xscreensaver can be bad for battery-powered devices... I notice that xscreensaver-gl is in ubuntu-netbook-remix by default, which is can bad news especially where 3D acceleration is partial or absent.  See also threads such as https://bugs.launchpad.net/ubuntu/+source/xscreensaver/+bug/174191)

Web (applicable to all standard installs)
	firefox
	xulrunner
	openjdk-6*

Generic acceleration libraries (desirable)
	liboil (number-crunching acceleration library used by pulseaudio, gstreamer etc.)

Media (desirable)
	gstreamer --- I don't know the architecture of gstreamer, so I can't comment on all the various subordinate packages here
	gstreamer*-ffmpeg
	gstreamer*-mp3
	pulseaudio
	libsamplerate
	libvorbis
	libtheora ? (Don't know how much theora video is really out there)

Office / apps (desirable)
	openoffice.org
	gimp

I've intentionally disregarded whether packages are libraries or not, in order to produce a more complete picture.  However, many of these packages _are_ libraries, so there is probably significant benefit to be had even if only library packages are built for VFP.

Cheers

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-02-13:

#23

Argh, tabs get entered in the thread as   --- if anyone finds that last post hard to read, let me know and I'll repost it in a more readable way.

Revision history for this message

Dave Martin (dave-martin-arm) wrote on 2009-02-13:

#24

I was also having a chat with Catalin about an alternative way of making VFP optimisations available.

I observed that Handhelds Mojo addresses architecture variants by having special package servers which are placed at the start of the apt sources list. See http://mojo.handhelds.org/hasty-armv6el-vfp/ for details of how it's done.

I haven't looked into the details, but it's feasible to share most packages between a specialised package server location and the standard server.

This approach could be used to make VFP-optimised versions of application packages available as well as libraries without the need to install all variants on each platform and choose at run-time.

Does anyone have a view on this?

Revision history for this message

Emmet Hikory (persia) wrote on 2009-02-13:

#25

Given the nature of the Ubuntu archives, it's probably easier to have differently named packages (e.g. libpangomm-1.4-1-vfp) distributed in the same archive, and maybe have a metapackage that pulled all the -vfp variants to ease the user experience than to try to configure multiple parallel repositories. I'm a fan of runtime detection (as done, for example, in liboil), but where that ends up using too much disk space, separate packages also work.

Revision history for this message

Loïc Minier (lool) wrote on 2009-02-16:

#26

Dave, first, thanks for your list of libs; I'd like to refine the list here.
I think Matthias looked at building multiple libgcc but that didn't see to work too well; it might not be interesting either as libgcc is meant to be a software implementation and hence doesn't have code to use use a hardware FPU.
Concerning libstdc++ this might be an interesting and easier target albeit I don't know how much fp math it does; it might face the same issue as libgcc though, I'm not sure.

I don't think libgnome* are going to be too interesting.

I don't know how useful qt and mesa would be, I think these can be better optimized by implementing custom backends for the GPU one targets. I've seen recent activity on the beagleboard list to enable a SGX backend for Qt for instance, and I now the intrepid PowerVR/Poulsbo drivers use some new mesa path.

Everything font related is likely to be very floating-point bound, so pango and freetype are high on my list.

I'm a bit worried for Pango and Gtk+ as these dlopen some plugins and I don't know whether hwcap can be used there; the plugins aren't too important for most of the use of the libs though.

cairo/pixman: would certainly benefit as well; I think these can be tuned to use a DSP or GPU-specific opts.

libpng/libjpeg: no strong opinion.

xorg: no idea what in xorg would benefit from it exactly; perhaps Xft?

Revision history for this message

Loïc Minier (lool) wrote on 2009-02-16:

#27

Dave, on the use of multiple sources.list entries:
Note that http://mojo.handhelds.org/hasty-armv6el-vfp/ only cascades deb-src entries. We discussed the idea of another archive, but this is quite a stretch to setup and maintain; of course you need infrastructure (disk space, buildds, bandwidth etc.), but also it needs maintenance (watching the builds), separate QA, and is quite confusing to end users. Ideally, you also want a mechanism to prevent mismatches of packages on target systems. The way to do this would be to add subarch support to dpkg/apt and enforce subarch equality when installing binary packages (the subarch of the binary package needs to match the subarch of the system).

[Note that the mojo folks "abused" the "arm" port; AFAIK they built binaries for EABI using "arm" as Debian architecture, so they offer .debs which dpkg will allow to install on a Debian arm install despite being another ABI. That was the easiest path to get these interesting archives, but it breaks ABI, so not something to do officially IMO. What's possible however is to rebuild the armel archive with higher CPU requirements, keeping the same ABI.]

Revision history for this message

Loïc Minier (lool) wrote on 2009-02-16:

#28

Emmet, I don't think runtime detection is a good idea for FPU; you would end up reimplementing the hwcap mechanism to select between two flavours of the library IMO -- unless all floating point is already in a dlopen-ed plugin, but that's not really the case for the libs here.

Also, I think we might not split in separate packages for all libs; only the largest ones. For instance a separate libc6-i686 is warranted since libc6 is installed in all chroots / installs and /lib/tls/i686/cmov is 2.6 MB, but libspeex1 ships both /usr/lib/sse2/libspeex.so.1.5.0 and /usr/lib/libspeex.so.1.5.0 because the SSE2 variant is only 116 kB. Note: on Ubuntu systems, libc6-i686 is a dep of ubuntu-minimal anyway.

Revision history for this message

Loïc Minier (lool) wrote on 2009-03-05:

#29

So I counted the number of "float" and "double" words in the source code of the various libraries proposed here with:
egrep -hrwc double\|float . | xargs | sed 's/ / + /g' | bc

that gave:
- pixman: 66 => likely no vfp version
- cairo: 2423 => will have a vfp version
- pango1.0: 683 => will have a vfp version
- gtk+2.0: 1430 => will have a vfp version
- ffmpeg-debian: 1679 => will have a vfp version
- xft: 46 => likely no vfp version
- freetype: 76 => likely no vfp version

this is all assuming that benchmarks don't contradict the above assumption (if benchmarks show a strong progress with vfp or no progress with vfp, it overrides the above list)

Note that for ffmpeg-debian the approach might be to build a vfp+neon version using shlibs alternate dependencies and providing separate packages.

Concerning liboil it needs runtime opts, not a new build.

Revision history for this message

Loïc Minier (lool) wrote on 2009-03-27:

#30

Let's discuss opts for karmic here; https://bugs.launchpad.net/ubuntu/+bugs?field.tag=arm-vfp list the bugs for the VFP optimized libs.

Changed in gcc-4.3:
status:	Confirmed → Won't Fix

Revision history for this message

Loïc Minier (lool) wrote on 2009-04-14:

#31

So for karmic we're aiming at:
-march=armv6 -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp

Changed in gcc-4.3 (Ubuntu):
assignee:	nobody → doko
status:	Confirmed → In Progress

Paul Larson (pwlars) on 2009-06-15

tags:

added: armel

Revision history for this message

Loïc Minier (lool) wrote on 2009-09-08:

#32

This was implemented in karmic gcc-4.4 4.4.1-3ubuntu2

affects:	gcc-4.3 (Ubuntu) → gcc-4.4 (Ubuntu)
Changed in gcc-4.4 (Ubuntu):
status:	In Progress → Fix Released

Affects		Status	Importance	Assigned to	Milestone
	gcc-4.4 (Ubuntu)	Fix Released	High	Matthias Klose
	Jaunty	Won't Fix	High	Unassigned

Ubuntu
gcc-4.4 package

armel gcc default optimisations

Bug Description

Other bug subscribers

Remote bug watches

Ubuntugcc-4.4 package

armel gcc default optimisations

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
gcc-4.4 package