armel gcc default optimisations

Bug #303232 reported by Catalin Marinas
26
Affects Status Importance Assigned to Milestone
gcc-4.4 (Ubuntu)
Fix Released
High
Matthias Klose
Jaunty
Won't Fix
High
Unassigned

Bug Description

Binary package hint: gcc-4.3

For the armel Ubuntu port, since it is optimised for ARMv7, it would be good to have gcc automatically generating code equivalent to the flags below so that all the applications would benefit, unrelated to whether CFLAGS was set or not:

-march=armv5t -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp

The above are compliant with the ARM EABI standard and also allow the code to run on ARMv5 hardware with VFP (e.g. ARM926). The generated code would be tuned for Cortex-A8 (ARMv7 architecture). The -mfloat-abi=softfp is required by the ARM EABI so that objects compiled for soft floating point could be linked (statically or dynamically) with hard floating point (VFP) objects.

The current patches from Debian applied to gcc change the default target to ARMv4t (arm9tdmi) in the gcc/config/arm/linux-eabi.h file (used, generally, by gcc/config/arm/arm.c to define the default architecture target and CPU tuning).

Thanks.

Tags: armel
Revision history for this message
Matthias Klose (doko) wrote : Re: [Bug 303232] [NEW] armel gcc default optimisations

Catalin Marinas schrieb:
> The current patches from Debian applied to gcc change the default target
> to ARMv4t (arm9tdmi) in the gcc/config/arm/linux-eabi.h file (used,
> generally, by gcc/config/arm/arm.c to define the default architecture
> target and CPU tuning).

the debian patch is not applied in the ubuntu build.

Loïc Minier (lool)
Changed in gcc-4.3:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Matthias Klose (doko) wrote :

no, it's not triaged.

 - is -mtune=cortex-a8 our choice? how does it affect other processors?
 - is -march=armv5t needed? afaiu we cannot run anymore on the thecus (Xscale)

Changed in gcc-4.3:
status: Triaged → Incomplete
Revision history for this message
Catalin Marinas (catalin-marinas) wrote :

The -mtune=cortex-a8 will optimise the generated code for the Cortex-A8 (ARMv7 processor) pipeline but not affecting the instruction set used (which is still ARMv5T). The resulting code will be optimal on ARMv7 but there may be a slight drop (if any) on other architecture versions.

Without the Debian gcc patches for forcing ARMv4T, the default in gcc I think is ARMv5T anyway.

AFAIK, the Xscale platform is ARMv5T but without the VFP coprocessor (affected by the -mfpu=vfp option). Is Thecus planned to be a supported platform by the upcoming ARM Ubuntu port?

Revision history for this message
Loïc Minier (lool) wrote :

Catalin, we'd like to support the Thecus N2100 and other XScale IOP32x with the -iop32x kernel flavour. The N2100 has a IOP 80219 which is ARMv5TE and like you I think it lacks a FPU.

Debian uses -mfloat-abi=soft; I understand that using -mfloat-abi=softfp creates binaries which are compatible with Debian's, but there's a performance hit on systems which don't have a FPU as floating point instructions generate a kernel trap to emulate them.

Matthias, what's the current setup?

I don't mind tuning for ARMv7 at all (probably makes little difference), but the FPU question is harder: it's a big hit for systems without a FPU to meet fp instructions, and it's probably a comparable big win for systems with a FPU to use fp instructions instead of full soft emulation.
  Would it be possible to have two libgccs with one doing full software emulation, and another one using the FPU? This would probably allow us to use -mfloat-abi=soft and still benefit from the FPU to some degree on systems having one.

Changed in gcc-4.3:
status: Incomplete → Confirmed
Revision history for this message
Catalin Marinas (catalin-marinas) wrote :

By default, gcc generates software floating point, so for this particular case no additional command line options are needed.

Using -mfloat-abi=softfp indeed creates binaries compatible with Debian (soft-float) since the function calling convention uses standard registers and stack for passing floating point arguments rather than VFP registers.

The VFP (hard-float) instructions are not emulated by the kernel (only the older FPA but they are no longer used by Debian armel), though much of the support for emulation is already in there. Currently, the kernel generates a SIGILL if such VFP instruction is encountered and the CPU doesn't support it.

Revision history for this message
Loïc Minier (lool) wrote :

Right, so -mfloat-abi=softfp generates binaries which use compatible calling conventions but do require a VFP.

I don't think we want this; instead we should rather optimize libs and programs to select VFP at runtime if available or provide alternate packages for VFP versus non VFP systems. One obvious candidate is libc.

I wonder whether it's possible and useful to build a libgcc which implements soft float computations with the FPU? That would seem like a good thing to do to optimize all soft float calls on systems which have a FPU.

Revision history for this message
Loïc Minier (lool) wrote :

(Typo s/VFP/FPU on the first line above, sorry.)

Does someone have pointers on the VFP trap handling kernel patches and on the issues with them (I guess there are issues if the patches aren't in the mainline)?

Revision history for this message
Catalin Marinas (catalin-marinas) wrote : Re: [Bug 303232] Re: armel gcc default optimisations

If -mfpu=vfp is enabled, the compiler will generate VFP instructions in
the asm code directly rather than calls to the libgcc soft-float code.
Even if the libgcc soft-float function could be replaced with the VFP
instructions, you still get an additional branch to those operations and
probably lower performance than complete VFP optimisation. Please note
that I haven't tried this approach comment more on the performance.

The libm is a candidate for this optimisation but there are applications
that would themselves benefit from being compiled with VFP. As I
understand, there are difficulties in maintaining two separate variants
for some packages (like OpenOffice).

There are no patches to enable full VFP emulation. AFAIK, the Linux
kernel community weren't keen to get such patches merged into mainline
because of emulation performance reasons.

Just for clarification, the kernel currently needs to trap the VFP
exceptions for 3 reasons:

1. The VFP is disabled at a context switch and the first encounter of a
VFP instruction triggers an undefined exception. At this point, the
kernel saves the VFP registers for the old application and loads those
for the new one.

2. On VFPv2 (found on ARMv5 and ARMv6 processors), the hardware does not
implement full IEEE754 compliance. There are corner cases (like
denormalised numbers) which aren't supported by hardware but the kernel
traps and emulates them. Note that the VFPv3 (on ARMv7 processors) has
full support for the IEEE754 compliance and there is no need for
additional kernel emulation (though the code is still there since it's
harmless).

3. Floating point operations exception (e.g. divide by zero) if the user
application enabled them and the hardware supports them.

Because of point 2 above, we have almost all the emulation code needed.
The only missing part is the emulation of the VFP registers (the kernel
currently reads the hardware ones) and maybe some optimisation to read
ahead and emulate more than one instruction in the exception handler
before returning to user.
--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Revision history for this message
Catalin Marinas (catalin-marinas) wrote :

Please ignore the legal disclaimer at the end of the previous post (I used the wrong SMTP server and it got appended automatically). Thanks.

Revision history for this message
Loïc Minier (lool) wrote :

[Ack, I do understand there's a performance hit when gong via libgcc, but I was hoping this could be a good compromise between not using hardware FPU at all and generating traps on systems without FPU.]

Concerning libm as a candidate for opts: libm is in libc which is definitely a candidate for an optimized version; for example with have an i686 version for the i386 arch which provides an alternate libm: http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist

Thanks for the details on supported traps; allow me to recap to make sure I understand this correctly

The mainline kernel (we're primarily looking at 2.6.28 in jaunty) supports some traps out of the box, but only:
- to handle VFP state saving and restoring across context switches (I guess this is to avoid saving/restoring the FPU regs when the switched to context doesn't need that)
- to emulate some corner cases not supported by all hardware FPUs
- on math errors
and we don't need to patch anything to get the above when we're using VFP instructions in programs. Perhaps we need to turn on some kernel CONFIG_s in our armel flavours though?

Otherwise, the mainline kernel can't emulate the base set of VFP instructions which ARMv5 and v6 cores with a FPU support on systems which lack a FPU (such as the Xscale example); we could perhaps patch this in, but it's not wanted in the mainline because it's too slow. I take it that the response from Linux developers is that we're supposed to not use VFP instructions in userspace on systems without a FPU?

I didn't find any flags in the gcc man page about VFPv2 or v3: I guess one can only tell gcc to generate instructions for the full VFP set or not at all.

Revision history for this message
Michael Casadevall (mcasadevall) wrote :
Download full text (9.9 KiB)

I did some performance benchmarks with pybench on an ARMv7 board. To prevent any third party processes from interfering, the board was running Ubuntu in single user mode, and the stock glibc. I'll run another set of benchmarks with our glibc tuned with the proposed flags, and also do another set of benchmarks on my NSLU2 (XScale/ARMv5) to see what sorta performance hit we're going to see.

With our current CFLAGS:
* Round 1 done in 62.165 seconds.
* Round 2 done in 62.229 seconds.
* Round 3 done in 61.994 seconds.
* Round 4 done in 61.616 seconds.
* Round 5 done in 62.371 seconds.
* Round 6 done in 63.191 seconds.
* Round 7 done in 62.180 seconds.
* Round 8 done in 62.165 seconds.
* Round 9 done in 61.906 seconds.
* Round 10 done in 62.977 seconds.

Test minimum average operation overhead
-------------------------------------------------------------------------------
          BuiltinFunctionCalls: 1302ms 1317ms 2.58us 3.114ms
           BuiltinMethodLookup: 871ms 871ms 0.83us 3.645ms
                 CompareFloats: 837ms 974ms 0.81us 4.171ms
         CompareFloatsIntegers: 963ms 1052ms 1.17us 3.112ms
               CompareIntegers: 657ms 659ms 0.37us 6.304ms
        CompareInternedStrings: 667ms 670ms 0.45us 16.016ms
                  CompareLongs: 564ms 566ms 0.54us 3.641ms
                CompareStrings: 550ms 556ms 0.56us 10.780ms
                CompareUnicode: 548ms 551ms 0.74us 8.163ms
    ComplexPythonFunctionCalls: 1699ms 1750ms 8.75us 5.260ms
                 ConcatStrings: 1017ms 1099ms 2.20us 6.975ms
                 ConcatUnicode: 4336ms 4720ms 15.73us 4.897ms
               CreateInstances: 1454ms 1463ms 13.06us 4.242ms
            CreateNewInstances: 1267ms 1283ms 15.28us 3.627ms
       CreateStringsWithConcat: 742ms 750ms 0.75us 10.567ms
       CreateUnicodeWithConcat: 772ms 778ms 1.94us 4.172ms
                  DictCreation: 695ms 695ms 1.74us 4.172ms
             DictWithFloatKeys: 1081ms 1086ms 1.21us 7.899ms
           DictWithIntegerKeys: 767ms 771ms 0.64us 10.566ms
            DictWithStringKeys: 668ms 672ms 0.56us 10.567ms
                      ForLoops: 671ms 672ms 26.88us 0.685ms
                    IfThenElse: 533ms 534ms 0.40us 7.898ms
                   ListSlicing: 557ms 563ms 40.19us 0.642ms
                NestedForLoops: 764ms 771ms 0.51us 0.247ms
      NestedListComprehensions: 1129ms 1148ms 95.63us 1.012ms
          NormalClassAttribute: 761ms 771ms 0.64us 5.271ms
       NormalInstanceAttribute: 697ms 697ms 0.58us 5.278ms
           PythonFunctionCalls: 693ms 698ms 2.11us 3.125ms
             PythonMethodCalls: 1568ms 1574ms 7.00us 1.566ms
                     Recursion: 1037ms 1043ms 20.86us 5.249ms
                  SecondImport: 1380ms 1382ms 13.82us 2.052ms
           SecondPackageImport: 1408ms 1411ms 14.11us 2.052ms
         SecondSu...

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Interesting results, but I guess we would not expect to see much of a performance increase by building python with hardware floating-point. It's an interpreter, so I expect that floating-point instructions are going to account for only a small proportion of the code executed, even when the python program being executed by the interpreter is doing some floating-point number-crunching.

If you can post the XScale results then that would be interesting: this would give a good indication of how the performance of general-purpose code will be affected when running code built with the proposed options on ARM9 platforms.

To get a better idea of the effect of building specific components for VFP, a package which does a lot of heavy floating point internally would be more interesting--- if we can obtain benchmarks for some backend libraries such as the following, this may give us a better overall idea what performance improvements would be possible by building VFP-enabled versions of some components:
    freetype
    pango
    cairo
    media backends (not so sure here, but maybe, but possibly libraries such as ffmpeg, vorbis, fftw)
    spidermonkey JavaScript engine (http://ftp.mozilla.org/pub/mozilla.org/js/)

I haven't looked into these in detail yet.

Revision history for this message
Catalin Marinas (catalin-marinas) wrote :

On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> [Ack, I do understand there's a performance hit when gong via libgcc,
> but I was hoping this could be a good compromise between not using
> hardware FPU at all and generating traps on systems without FPU.]

I haven't looked but I think there is more work here as the softfloat
function in libgcc would need to be rewritten to use VFP instructions.
Compiling with -mfpu=vfp simply ignores those functions.

> Concerning libm as a candidate for opts: libm is in libc which is
> definitely a candidate for an optimized version; for example with have
> an i686 version for the i386 arch which provides an alternate libm:
> http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist

Could we not have the same for other packages that would benefit from
VFP? I'm not too familiar with the version conflicts in the .deb
packages but, for example, could we have something like a dummy
openoffice package which depends on either openoffice-softfp or
openoffice-vfp?

> The mainline kernel (we're primarily looking at 2.6.28 in jaunty)
> supports some traps out of the box, but only:
> - to handle VFP state saving and restoring across context switches (I
> guess this is to avoid saving/restoring the FPU regs when the switched
> to context doesn't need that)
> - to emulate some corner cases not supported by all hardware FPUs
> - on math errors
> and we don't need to patch anything to get the above when we're using
> VFP instructions in programs. Perhaps we need to turn on some kernel
> CONFIG_s in our armel flavours though?

You need to have CONFIG_VFP on. This could always be on by default as
the kernel checks for the hardware feature before enabling the
corresponding hwcap bit.

> Otherwise, the mainline kernel can't emulate the base set of VFP
> instructions which ARMv5 and v6 cores with a FPU support on systems
> which lack a FPU (such as the Xscale example); we could perhaps patch
> this in, but it's not wanted in the mainline because it's too slow. I
> take it that the response from Linux developers is that we're supposed
> to not use VFP instructions in userspace on systems without a FPU?

Probably that's the reason as we now have a fast softfloat
implementation. This is the relevant thread (though not many opinions):

http://thread.gmane.org/gmane.linux.ports.arm.kernel/47219

> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
> one can only tell gcc to generate instructions for the full VFP set or
> not at all.

By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
think that's what should be used as VFPv3 only comes with ARMv7.

Revision history for this message
Colin Watson (cjwatson) wrote :

On Wed, Jan 28, 2009 at 11:17:27AM -0000, Catalin Marinas wrote:
> On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> > Concerning libm as a candidate for opts: libm is in libc which is
> > definitely a candidate for an optimized version; for example with have
> > an i686 version for the i386 arch which provides an alternate libm:
> > http://packages.ubuntu.com/jaunty/i386/libc6-i686/filelist
>
> Could we not have the same for other packages that would benefit from
> VFP? I'm not too familiar with the version conflicts in the .deb
> packages but, for example, could we have something like a dummy
> openoffice package which depends on either openoffice-softfp or
> openoffice-vfp?

OpenOffice.org's packaging is already very complex, consisting of 57
binary packages with complex dependencies both among its own packages
and between them and others. I think it is unlikely that adding FP
variants of many of those with different binary package names would
successfully meet anyone's requirements.

Even elsewhere, this approach is fraught with practical problems. We can
take this approach successfully with glibc, but glibc is a very special
case. We can cope with having libc6-i686, libc6-sparcv9, and the like
because libc6 is always installed and essentially never changes its
SONAME, and so the installer can include special-case code to make sure
that optimised variants of it are installed; we can easily extend that
code for ARM, and I expect it will make sense to do so. This isn't
something we can generalise in any straightforward way, though. The
package management system used on the system after installation (i.e.
dpkg, apt, and friends) doesn't know how to identify and select packages
appropriate for variants of a particular architecture, and so there
would be no way to guarantee that the proper optimised packages get
installed when a user asks for the dummy name.

(Attempting to add this feature would run into a number of interesting
and long-standing roadblocks such as the lack of versioned Provides, and
I suspect would start to approach "multiarch", which is a huge and
somewhat technically-controversial project.)

This problem applies to separate binary packages, but not to hwcap
optimisations in general, of course. As long as hwcap works (whether in
the runtime linker or a similar implementation elsewhere, e.g.
GStreamer), we can just put the optimised objects in the same package as
the unoptimised ones, as long as there aren't too many variants and you
don't mind the binary package size growing a bit. Since I understand
that we aren't likely to be supporting normal CD images for ARM, this
isn't as much of a problem as it would be for other architectures.

(OOo itself might still be a problem due to sheer size and the utter
horribleness of its build system, but I assume you meant it more as an
example than as an initial target!)

Revision history for this message
Matthias Klose (doko) wrote :

Catalin Marinas schrieb:
>> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
>> one can only tell gcc to generate instructions for the full VFP set or
>> not at all.
>
> By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
> think that's what should be used as VFPv3 only comes with ARMv7.

Is there a known GCC multilib config for non-vfp/vfp?

Revision history for this message
Matthias Klose (doko) wrote :

compared pybench and pystone benchmark results. Compared to a default to armv5 with the proposed flags, together with a glibc built with the proposed flags, you do see 4-5% improvement in the benchmarks. Note that these benchmarks don't use floating point.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Which board was this on?

Revision history for this message
Matthias Klose (doko) wrote :

Dave Martin schrieb:
> Which board was this on?

my python benchmarks were done on the babbage board.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

This seems like a plausible amount of improvement for Cortex-A8 tuning on a non-floating-point-intensive package and library stack.

Did you get a chance to try interoperable non-vfp code on ARM9/XScale?
(i.e., -march=armv5 -mtune=cortex-a8... I'm assuming -msoft-float is present by default).

It would be worth seeing whether the Cortex-A8 tuning has any appreciable impact on performance on these platforms.

Revision history for this message
Catalin Marinas (catalin-marinas) wrote :

Thinking a bit more, I think the VFP emulation for older hardware could
be a viable option to have all the packages compiled with -mfpu=vfp. I
can't estimate the performance difference between softfloat and VFP
emulation but, if the performance on Xscale is visibly affected, we
still have the option of two builds of glibc (and other libraries) where
the default one is VFP and the additional one is softfloat. Since
-mfloat-abi=softfp, mixing VFP and softfloat packages is still possible.

As I said, all the VFP floating point operations are already implemented
in the kernel. There are around 20 additional instructions (actually
variants of a fewer number) for transferring data between VFP registers
and memory or ARM registers which aren't handled plus the VFP register
bank to be accessed from memory rather than the coprocessor registers.
As a rough estimate, I think it would take me around 2 weeks to get this
done.

Does this sound feasible? Thanks.

Revision history for this message
Loïc Minier (lool) wrote :

Catalin, thanks for your proposal; we discussed this option this week since we were meeting with other Canonical people, notably Matthias and Colin.

On the idea to handle all VFP instructions as kernel traps on systems without a FPU, we think it will be too slow. In my experience, even alignment traps are a huge hit on performance when they happen, and FPU instructions would appear in a lot of random places. So while it would technically "work", it wouldn't be working at a decent level.

We're currently trying to assemble good benchmarks to decide about another solution, and we will get back to you as soon as we manage to have enough data. (We're using the mojo prebuilt hardy archives as a base to compare the impact of various opts in various combinations.)

However, the kernel traps would be an excellent safety net in case some binary with VFP instructions ends up on such systems, and it might be a good idea to have this support; but it wont be enough by itself to turn VFP instructions generation on in jaunty. :-/

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

I've been taking a look at package lists and the dependencies of key apps and libraries. The following look worth investigating with respect to VFP optimisations:

I haven't tidied this list up fully, so there is a mix of binary and source package names... I've done no benchmarking or profiling on these yet, so I can't guarantee that they're all relevant... anyway, here's my current list:

Desktop/Universal (applicable to pretty much any install with a GUI)
 gcc libraries (libgcc* libstdc++* libgcj* etc.)
 libc6 (i.e., glibc)
 cairo2
 freetype6
 pango
 pixman (uncertain how much floating-point this contains)
 libjpeg* (uncertain how much floating-point this contains)
 libpng* (uncertain how much floating-point this contains)
 libgtk+* (uncertain how much floating-point this contains)
 libgnome* (uncertain how much floating-point this contains)
 libqt* (uncertain how much floating-point this contains; is this used outside kubuntu?)
 libgl1-mesa-* (Probably we're expecting OEM's own GL acceleration to be used instead?)
 libglu1-mesa (Probably we're expecting OEM's own GL acceleration to be used instead?)
 xorg (probably not a priority; most of the potential benefit will come from SoC manufacturers' X drivers. Only certain extensions such as RENDER, COMPOSITE and GLX, and the core font rendering which is not much used these days, are likely to make much use of floating-point)
 xscreensaver* (Not directly useful, but may impact user perception of performance. There is a separate, related issue: xscreensaver can be bad for battery-powered devices... I notice that xscreensaver-gl is in ubuntu-netbook-remix by default, which is can bad news especially where 3D acceleration is partial or absent. See also threads such as https://bugs.launchpad.net/ubuntu/+source/xscreensaver/+bug/174191)

Web (applicable to all standard installs)
 firefox
 xulrunner
 openjdk-6*

Generic acceleration libraries (desirable)
 liboil (number-crunching acceleration library used by pulseaudio, gstreamer etc.)

Media (desirable)
 gstreamer --- I don't know the architecture of gstreamer, so I can't comment on all the various subordinate packages here
 gstreamer*-ffmpeg
 gstreamer*-mp3
 pulseaudio
 libsamplerate
 libvorbis
 libtheora ? (Don't know how much theora video is really out there)

Office / apps (desirable)
 openoffice.org
 gimp

I've intentionally disregarded whether packages are libraries or not, in order to produce a more complete picture. However, many of these packages _are_ libraries, so there is probably significant benefit to be had even if only library packages are built for VFP.

Cheers

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

Argh, tabs get entered in the thread as   --- if anyone finds that last post hard to read, let me know and I'll repost it in a more readable way.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

I was also having a chat with Catalin about an alternative way of making VFP optimisations available.

I observed that Handhelds Mojo addresses architecture variants by having special package servers which are placed at the start of the apt sources list. See http://mojo.handhelds.org/hasty-armv6el-vfp/ for details of how it's done.

I haven't looked into the details, but it's feasible to share most packages between a specialised package server location and the standard server.

This approach could be used to make VFP-optimised versions of application packages available as well as libraries without the need to install all variants on each platform and choose at run-time.

Does anyone have a view on this?

Revision history for this message
Emmet Hikory (persia) wrote :

Given the nature of the Ubuntu archives, it's probably easier to have differently named packages (e.g. libpangomm-1.4-1-vfp) distributed in the same archive, and maybe have a metapackage that pulled all the -vfp variants to ease the user experience than to try to configure multiple parallel repositories. I'm a fan of runtime detection (as done, for example, in liboil), but where that ends up using too much disk space, separate packages also work.

Revision history for this message
Loïc Minier (lool) wrote :

Dave, first, thanks for your list of libs; I'd like to refine the list here.
  I think Matthias looked at building multiple libgcc but that didn't see to work too well; it might not be interesting either as libgcc is meant to be a software implementation and hence doesn't have code to use use a hardware FPU.
  Concerning libstdc++ this might be an interesting and easier target albeit I don't know how much fp math it does; it might face the same issue as libgcc though, I'm not sure.

I don't think libgnome* are going to be too interesting.

I don't know how useful qt and mesa would be, I think these can be better optimized by implementing custom backends for the GPU one targets. I've seen recent activity on the beagleboard list to enable a SGX backend for Qt for instance, and I now the intrepid PowerVR/Poulsbo drivers use some new mesa path.

Everything font related is likely to be very floating-point bound, so pango and freetype are high on my list.

I'm a bit worried for Pango and Gtk+ as these dlopen some plugins and I don't know whether hwcap can be used there; the plugins aren't too important for most of the use of the libs though.

cairo/pixman: would certainly benefit as well; I think these can be tuned to use a DSP or GPU-specific opts.

libpng/libjpeg: no strong opinion.

xorg: no idea what in xorg would benefit from it exactly; perhaps Xft?

Revision history for this message
Loïc Minier (lool) wrote :

Dave, on the use of multiple sources.list entries:
Note that http://mojo.handhelds.org/hasty-armv6el-vfp/ only cascades deb-src entries. We discussed the idea of another archive, but this is quite a stretch to setup and maintain; of course you need infrastructure (disk space, buildds, bandwidth etc.), but also it needs maintenance (watching the builds), separate QA, and is quite confusing to end users. Ideally, you also want a mechanism to prevent mismatches of packages on target systems. The way to do this would be to add subarch support to dpkg/apt and enforce subarch equality when installing binary packages (the subarch of the binary package needs to match the subarch of the system).

[Note that the mojo folks "abused" the "arm" port; AFAIK they built binaries for EABI using "arm" as Debian architecture, so they offer .debs which dpkg will allow to install on a Debian arm install despite being another ABI. That was the easiest path to get these interesting archives, but it breaks ABI, so not something to do officially IMO. What's possible however is to rebuild the armel archive with higher CPU requirements, keeping the same ABI.]

Revision history for this message
Loïc Minier (lool) wrote :

Emmet, I don't think runtime detection is a good idea for FPU; you would end up reimplementing the hwcap mechanism to select between two flavours of the library IMO -- unless all floating point is already in a dlopen-ed plugin, but that's not really the case for the libs here.

Also, I think we might not split in separate packages for all libs; only the largest ones. For instance a separate libc6-i686 is warranted since libc6 is installed in all chroots / installs and /lib/tls/i686/cmov is 2.6 MB, but libspeex1 ships both /usr/lib/sse2/libspeex.so.1.5.0 and /usr/lib/libspeex.so.1.5.0 because the SSE2 variant is only 116 kB. Note: on Ubuntu systems, libc6-i686 is a dep of ubuntu-minimal anyway.

Revision history for this message
Loïc Minier (lool) wrote :

So I counted the number of "float" and "double" words in the source code of the various libraries proposed here with:
egrep -hrwc double\|float . | xargs | sed 's/ / + /g' | bc

that gave:
- pixman: 66 => likely no vfp version
- cairo: 2423 => will have a vfp version
- pango1.0: 683 => will have a vfp version
- gtk+2.0: 1430 => will have a vfp version
- ffmpeg-debian: 1679 => will have a vfp version
- xft: 46 => likely no vfp version
- freetype: 76 => likely no vfp version

this is all assuming that benchmarks don't contradict the above assumption (if benchmarks show a strong progress with vfp or no progress with vfp, it overrides the above list)

Note that for ffmpeg-debian the approach might be to build a vfp+neon version using shlibs alternate dependencies and providing separate packages.

Concerning liboil it needs runtime opts, not a new build.

Revision history for this message
Loïc Minier (lool) wrote :

Let's discuss opts for karmic here; https://bugs.launchpad.net/ubuntu/+bugs?field.tag=arm-vfp list the bugs for the VFP optimized libs.

Changed in gcc-4.3:
status: Confirmed → Won't Fix
Revision history for this message
Loïc Minier (lool) wrote :

So for karmic we're aiming at:
-march=armv6 -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp

Changed in gcc-4.3 (Ubuntu):
assignee: nobody → doko
status: Confirmed → In Progress
Paul Larson (pwlars)
tags: added: armel
Revision history for this message
Loïc Minier (lool) wrote :

This was implemented in karmic gcc-4.4 4.4.1-3ubuntu2

affects: gcc-4.3 (Ubuntu) → gcc-4.4 (Ubuntu)
Changed in gcc-4.4 (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.