armel gcc default optimisations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gcc-4.4 (Ubuntu) |
Fix Released
|
High
|
Matthias Klose | ||
Jaunty |
Won't Fix
|
High
|
Unassigned |
Bug Description
Binary package hint: gcc-4.3
For the armel Ubuntu port, since it is optimised for ARMv7, it would be good to have gcc automatically generating code equivalent to the flags below so that all the applications would benefit, unrelated to whether CFLAGS was set or not:
-march=armv5t -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp
The above are compliant with the ARM EABI standard and also allow the code to run on ARMv5 hardware with VFP (e.g. ARM926). The generated code would be tuned for Cortex-A8 (ARMv7 architecture). The -mfloat-abi=softfp is required by the ARM EABI so that objects compiled for soft floating point could be linked (statically or dynamically) with hard floating point (VFP) objects.
The current patches from Debian applied to gcc change the default target to ARMv4t (arm9tdmi) in the gcc/config/
Thanks.
Matthias Klose (doko) wrote : Re: [Bug 303232] [NEW] armel gcc default optimisations | #1 |
Changed in gcc-4.3: | |
importance: | Undecided → High |
status: | New → Triaged |
Matthias Klose (doko) wrote : | #2 |
no, it's not triaged.
- is -mtune=cortex-a8 our choice? how does it affect other processors?
- is -march=armv5t needed? afaiu we cannot run anymore on the thecus (Xscale)
Changed in gcc-4.3: | |
status: | Triaged → Incomplete |
Catalin Marinas (catalin-marinas) wrote : | #3 |
The -mtune=cortex-a8 will optimise the generated code for the Cortex-A8 (ARMv7 processor) pipeline but not affecting the instruction set used (which is still ARMv5T). The resulting code will be optimal on ARMv7 but there may be a slight drop (if any) on other architecture versions.
Without the Debian gcc patches for forcing ARMv4T, the default in gcc I think is ARMv5T anyway.
AFAIK, the Xscale platform is ARMv5T but without the VFP coprocessor (affected by the -mfpu=vfp option). Is Thecus planned to be a supported platform by the upcoming ARM Ubuntu port?
Loïc Minier (lool) wrote : | #4 |
Catalin, we'd like to support the Thecus N2100 and other XScale IOP32x with the -iop32x kernel flavour. The N2100 has a IOP 80219 which is ARMv5TE and like you I think it lacks a FPU.
Debian uses -mfloat-abi=soft; I understand that using -mfloat-abi=softfp creates binaries which are compatible with Debian's, but there's a performance hit on systems which don't have a FPU as floating point instructions generate a kernel trap to emulate them.
Matthias, what's the current setup?
I don't mind tuning for ARMv7 at all (probably makes little difference), but the FPU question is harder: it's a big hit for systems without a FPU to meet fp instructions, and it's probably a comparable big win for systems with a FPU to use fp instructions instead of full soft emulation.
Would it be possible to have two libgccs with one doing full software emulation, and another one using the FPU? This would probably allow us to use -mfloat-abi=soft and still benefit from the FPU to some degree on systems having one.
Changed in gcc-4.3: | |
status: | Incomplete → Confirmed |
Catalin Marinas (catalin-marinas) wrote : | #5 |
By default, gcc generates software floating point, so for this particular case no additional command line options are needed.
Using -mfloat-abi=softfp indeed creates binaries compatible with Debian (soft-float) since the function calling convention uses standard registers and stack for passing floating point arguments rather than VFP registers.
The VFP (hard-float) instructions are not emulated by the kernel (only the older FPA but they are no longer used by Debian armel), though much of the support for emulation is already in there. Currently, the kernel generates a SIGILL if such VFP instruction is encountered and the CPU doesn't support it.
Loïc Minier (lool) wrote : | #6 |
Right, so -mfloat-abi=softfp generates binaries which use compatible calling conventions but do require a VFP.
I don't think we want this; instead we should rather optimize libs and programs to select VFP at runtime if available or provide alternate packages for VFP versus non VFP systems. One obvious candidate is libc.
I wonder whether it's possible and useful to build a libgcc which implements soft float computations with the FPU? That would seem like a good thing to do to optimize all soft float calls on systems which have a FPU.
Loïc Minier (lool) wrote : | #7 |
(Typo s/VFP/FPU on the first line above, sorry.)
Does someone have pointers on the VFP trap handling kernel patches and on the issues with them (I guess there are issues if the patches aren't in the mainline)?
Catalin Marinas (catalin-marinas) wrote : Re: [Bug 303232] Re: armel gcc default optimisations | #8 |
If -mfpu=vfp is enabled, the compiler will generate VFP instructions in
the asm code directly rather than calls to the libgcc soft-float code.
Even if the libgcc soft-float function could be replaced with the VFP
instructions, you still get an additional branch to those operations and
probably lower performance than complete VFP optimisation. Please note
that I haven't tried this approach comment more on the performance.
The libm is a candidate for this optimisation but there are applications
that would themselves benefit from being compiled with VFP. As I
understand, there are difficulties in maintaining two separate variants
for some packages (like OpenOffice).
There are no patches to enable full VFP emulation. AFAIK, the Linux
kernel community weren't keen to get such patches merged into mainline
because of emulation performance reasons.
Just for clarification, the kernel currently needs to trap the VFP
exceptions for 3 reasons:
1. The VFP is disabled at a context switch and the first encounter of a
VFP instruction triggers an undefined exception. At this point, the
kernel saves the VFP registers for the old application and loads those
for the new one.
2. On VFPv2 (found on ARMv5 and ARMv6 processors), the hardware does not
implement full IEEE754 compliance. There are corner cases (like
denormalised numbers) which aren't supported by hardware but the kernel
traps and emulates them. Note that the VFPv3 (on ARMv7 processors) has
full support for the IEEE754 compliance and there is no need for
additional kernel emulation (though the code is still there since it's
harmless).
3. Floating point operations exception (e.g. divide by zero) if the user
application enabled them and the hardware supports them.
Because of point 2 above, we have almost all the emulation code needed.
The only missing part is the emulation of the VFP registers (the kernel
currently reads the hardware ones) and maybe some optimisation to read
ahead and emulate more than one instruction in the exception handler
before returning to user.
--
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Catalin Marinas (catalin-marinas) wrote : | #9 |
Please ignore the legal disclaimer at the end of the previous post (I used the wrong SMTP server and it got appended automatically). Thanks.
Loïc Minier (lool) wrote : | #10 |
[Ack, I do understand there's a performance hit when gong via libgcc, but I was hoping this could be a good compromise between not using hardware FPU at all and generating traps on systems without FPU.]
Concerning libm as a candidate for opts: libm is in libc which is definitely a candidate for an optimized version; for example with have an i686 version for the i386 arch which provides an alternate libm: http://
Thanks for the details on supported traps; allow me to recap to make sure I understand this correctly
The mainline kernel (we're primarily looking at 2.6.28 in jaunty) supports some traps out of the box, but only:
- to handle VFP state saving and restoring across context switches (I guess this is to avoid saving/restoring the FPU regs when the switched to context doesn't need that)
- to emulate some corner cases not supported by all hardware FPUs
- on math errors
and we don't need to patch anything to get the above when we're using VFP instructions in programs. Perhaps we need to turn on some kernel CONFIG_s in our armel flavours though?
Otherwise, the mainline kernel can't emulate the base set of VFP instructions which ARMv5 and v6 cores with a FPU support on systems which lack a FPU (such as the Xscale example); we could perhaps patch this in, but it's not wanted in the mainline because it's too slow. I take it that the response from Linux developers is that we're supposed to not use VFP instructions in userspace on systems without a FPU?
I didn't find any flags in the gcc man page about VFPv2 or v3: I guess one can only tell gcc to generate instructions for the full VFP set or not at all.
Michael Casadevall (mcasadevall) wrote : | #11 |
I did some performance benchmarks with pybench on an ARMv7 board. To prevent any third party processes from interfering, the board was running Ubuntu in single user mode, and the stock glibc. I'll run another set of benchmarks with our glibc tuned with the proposed flags, and also do another set of benchmarks on my NSLU2 (XScale/ARMv5) to see what sorta performance hit we're going to see.
With our current CFLAGS:
* Round 1 done in 62.165 seconds.
* Round 2 done in 62.229 seconds.
* Round 3 done in 61.994 seconds.
* Round 4 done in 61.616 seconds.
* Round 5 done in 62.371 seconds.
* Round 6 done in 63.191 seconds.
* Round 7 done in 62.180 seconds.
* Round 8 done in 62.165 seconds.
* Round 9 done in 61.906 seconds.
* Round 10 done in 62.977 seconds.
Test minimum average operation overhead
-------
ComplexPyth
NestedLis
Dave Martin (dave-martin-arm) wrote : | #12 |
Interesting results, but I guess we would not expect to see much of a performance increase by building python with hardware floating-point. It's an interpreter, so I expect that floating-point instructions are going to account for only a small proportion of the code executed, even when the python program being executed by the interpreter is doing some floating-point number-crunching.
If you can post the XScale results then that would be interesting: this would give a good indication of how the performance of general-purpose code will be affected when running code built with the proposed options on ARM9 platforms.
To get a better idea of the effect of building specific components for VFP, a package which does a lot of heavy floating point internally would be more interesting--- if we can obtain benchmarks for some backend libraries such as the following, this may give us a better overall idea what performance improvements would be possible by building VFP-enabled versions of some components:
freetype
pango
cairo
media backends (not so sure here, but maybe, but possibly libraries such as ffmpeg, vorbis, fftw)
spidermonkey JavaScript engine (http://
I haven't looked into these in detail yet.
Catalin Marinas (catalin-marinas) wrote : | #13 |
On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> [Ack, I do understand there's a performance hit when gong via libgcc,
> but I was hoping this could be a good compromise between not using
> hardware FPU at all and generating traps on systems without FPU.]
I haven't looked but I think there is more work here as the softfloat
function in libgcc would need to be rewritten to use VFP instructions.
Compiling with -mfpu=vfp simply ignores those functions.
> Concerning libm as a candidate for opts: libm is in libc which is
> definitely a candidate for an optimized version; for example with have
> an i686 version for the i386 arch which provides an alternate libm:
> http://
Could we not have the same for other packages that would benefit from
VFP? I'm not too familiar with the version conflicts in the .deb
packages but, for example, could we have something like a dummy
openoffice package which depends on either openoffice-softfp or
openoffice-vfp?
> The mainline kernel (we're primarily looking at 2.6.28 in jaunty)
> supports some traps out of the box, but only:
> - to handle VFP state saving and restoring across context switches (I
> guess this is to avoid saving/restoring the FPU regs when the switched
> to context doesn't need that)
> - to emulate some corner cases not supported by all hardware FPUs
> - on math errors
> and we don't need to patch anything to get the above when we're using
> VFP instructions in programs. Perhaps we need to turn on some kernel
> CONFIG_s in our armel flavours though?
You need to have CONFIG_VFP on. This could always be on by default as
the kernel checks for the hardware feature before enabling the
corresponding hwcap bit.
> Otherwise, the mainline kernel can't emulate the base set of VFP
> instructions which ARMv5 and v6 cores with a FPU support on systems
> which lack a FPU (such as the Xscale example); we could perhaps patch
> this in, but it's not wanted in the mainline because it's too slow. I
> take it that the response from Linux developers is that we're supposed
> to not use VFP instructions in userspace on systems without a FPU?
Probably that's the reason as we now have a fast softfloat
implementation. This is the relevant thread (though not many opinions):
http://
> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
> one can only tell gcc to generate instructions for the full VFP set or
> not at all.
By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
think that's what should be used as VFPv3 only comes with ARMv7.
Colin Watson (cjwatson) wrote : | #14 |
On Wed, Jan 28, 2009 at 11:17:27AM -0000, Catalin Marinas wrote:
> On Thu, 2009-01-22 at 15:58 +0000, Loïc Minier wrote:
> > Concerning libm as a candidate for opts: libm is in libc which is
> > definitely a candidate for an optimized version; for example with have
> > an i686 version for the i386 arch which provides an alternate libm:
> > http://
>
> Could we not have the same for other packages that would benefit from
> VFP? I'm not too familiar with the version conflicts in the .deb
> packages but, for example, could we have something like a dummy
> openoffice package which depends on either openoffice-softfp or
> openoffice-vfp?
OpenOffice.org's packaging is already very complex, consisting of 57
binary packages with complex dependencies both among its own packages
and between them and others. I think it is unlikely that adding FP
variants of many of those with different binary package names would
successfully meet anyone's requirements.
Even elsewhere, this approach is fraught with practical problems. We can
take this approach successfully with glibc, but glibc is a very special
case. We can cope with having libc6-i686, libc6-sparcv9, and the like
because libc6 is always installed and essentially never changes its
SONAME, and so the installer can include special-case code to make sure
that optimised variants of it are installed; we can easily extend that
code for ARM, and I expect it will make sense to do so. This isn't
something we can generalise in any straightforward way, though. The
package management system used on the system after installation (i.e.
dpkg, apt, and friends) doesn't know how to identify and select packages
appropriate for variants of a particular architecture, and so there
would be no way to guarantee that the proper optimised packages get
installed when a user asks for the dummy name.
(Attempting to add this feature would run into a number of interesting
and long-standing roadblocks such as the lack of versioned Provides, and
I suspect would start to approach "multiarch", which is a huge and
somewhat technically-
This problem applies to separate binary packages, but not to hwcap
optimisations in general, of course. As long as hwcap works (whether in
the runtime linker or a similar implementation elsewhere, e.g.
GStreamer), we can just put the optimised objects in the same package as
the unoptimised ones, as long as there aren't too many variants and you
don't mind the binary package size growing a bit. Since I understand
that we aren't likely to be supporting normal CD images for ARM, this
isn't as much of a problem as it would be for other architectures.
(OOo itself might still be a problem due to sheer size and the utter
horribleness of its build system, but I assume you meant it more as an
example than as an initial target!)
Matthias Klose (doko) wrote : | #15 |
Catalin Marinas schrieb:
>> I didn't find any flags in the gcc man page about VFPv2 or v3: I guess
>> one can only tell gcc to generate instructions for the full VFP set or
>> not at all.
>
> By default, with -mfpu=vfp, the compiler generates VFPv2 code and I
> think that's what should be used as VFPv3 only comes with ARMv7.
Is there a known GCC multilib config for non-vfp/vfp?
Matthias Klose (doko) wrote : | #16 |
compared pybench and pystone benchmark results. Compared to a default to armv5 with the proposed flags, together with a glibc built with the proposed flags, you do see 4-5% improvement in the benchmarks. Note that these benchmarks don't use floating point.
Dave Martin (dave-martin-arm) wrote : | #17 |
Which board was this on?
Matthias Klose (doko) wrote : | #18 |
Dave Martin schrieb:
> Which board was this on?
my python benchmarks were done on the babbage board.
Dave Martin (dave-martin-arm) wrote : | #19 |
This seems like a plausible amount of improvement for Cortex-A8 tuning on a non-floating-
Did you get a chance to try interoperable non-vfp code on ARM9/XScale?
(i.e., -march=armv5 -mtune=cortex-a8... I'm assuming -msoft-float is present by default).
It would be worth seeing whether the Cortex-A8 tuning has any appreciable impact on performance on these platforms.
Catalin Marinas (catalin-marinas) wrote : | #20 |
Thinking a bit more, I think the VFP emulation for older hardware could
be a viable option to have all the packages compiled with -mfpu=vfp. I
can't estimate the performance difference between softfloat and VFP
emulation but, if the performance on Xscale is visibly affected, we
still have the option of two builds of glibc (and other libraries) where
the default one is VFP and the additional one is softfloat. Since
-mfloat-abi=softfp, mixing VFP and softfloat packages is still possible.
As I said, all the VFP floating point operations are already implemented
in the kernel. There are around 20 additional instructions (actually
variants of a fewer number) for transferring data between VFP registers
and memory or ARM registers which aren't handled plus the VFP register
bank to be accessed from memory rather than the coprocessor registers.
As a rough estimate, I think it would take me around 2 weeks to get this
done.
Does this sound feasible? Thanks.
Loïc Minier (lool) wrote : | #21 |
Catalin, thanks for your proposal; we discussed this option this week since we were meeting with other Canonical people, notably Matthias and Colin.
On the idea to handle all VFP instructions as kernel traps on systems without a FPU, we think it will be too slow. In my experience, even alignment traps are a huge hit on performance when they happen, and FPU instructions would appear in a lot of random places. So while it would technically "work", it wouldn't be working at a decent level.
We're currently trying to assemble good benchmarks to decide about another solution, and we will get back to you as soon as we manage to have enough data. (We're using the mojo prebuilt hardy archives as a base to compare the impact of various opts in various combinations.)
However, the kernel traps would be an excellent safety net in case some binary with VFP instructions ends up on such systems, and it might be a good idea to have this support; but it wont be enough by itself to turn VFP instructions generation on in jaunty. :-/
Dave Martin (dave-martin-arm) wrote : | #22 |
I've been taking a look at package lists and the dependencies of key apps and libraries. The following look worth investigating with respect to VFP optimisations:
I haven't tidied this list up fully, so there is a mix of binary and source package names... I've done no benchmarking or profiling on these yet, so I can't guarantee that they're all relevant... anyway, here's my current list:
Desktop/Universal (applicable to pretty much any install with a GUI)
gcc libraries (libgcc* libstdc++* libgcj* etc.)
libc6 (i.e., glibc)
cairo2
freetype6
pango
pixman (uncertain how much floating-point this contains)
libjpeg* (uncertain how much floating-point this contains)
libpng* (uncertain how much floating-point this contains)
libgtk+* (uncertain how much floating-point this contains)
libgnome* (uncertain how much floating-point this contains)
libqt* (uncertain how much floating-point this contains; is this used outside kubuntu?)
libgl1-mesa-* (Probably we're expecting OEM's own GL acceleration to be used instead?)
libglu1-mesa (Probably we're expecting OEM's own GL acceleration to be used instead?)
xorg (probably not a priority; most of the potential benefit will come from SoC manufacturers' X drivers. Only certain extensions such as RENDER, COMPOSITE and GLX, and the core font rendering which is not much used these days, are likely to make much use of floating-point)
xscreensaver* (Not directly useful, but may impact user perception of performance. There is a separate, related issue: xscreensaver can be bad for battery-powered devices... I notice that xscreensaver-gl is in ubuntu-
Web (applicable to all standard installs)
firefox
xulrunner
openjdk-6*
Generic acceleration libraries (desirable)
liboil (number-crunching acceleration library used by pulseaudio, gstreamer etc.)
Media (desirable)
gstreamer --- I don't know the architecture of gstreamer, so I can't comment on all the various subordinate packages here
gstreamer*-ffmpeg
gstreamer*-mp3
pulseaudio
libsamplerate
libvorbis
libtheora ? (Don't know how much theora video is really out there)
Office / apps (desirable)
openoffice.org
gimp
I've intentionally disregarded whether packages are libraries or not, in order to produce a more complete picture. However, many of these packages _are_ libraries, so there is probably significant benefit to be had even if only library packages are built for VFP.
Cheers
Dave Martin (dave-martin-arm) wrote : | #23 |
Argh, tabs get entered in the thread as --- if anyone finds that last post hard to read, let me know and I'll repost it in a more readable way.
Dave Martin (dave-martin-arm) wrote : | #24 |
I was also having a chat with Catalin about an alternative way of making VFP optimisations available.
I observed that Handhelds Mojo addresses architecture variants by having special package servers which are placed at the start of the apt sources list. See http://
I haven't looked into the details, but it's feasible to share most packages between a specialised package server location and the standard server.
This approach could be used to make VFP-optimised versions of application packages available as well as libraries without the need to install all variants on each platform and choose at run-time.
Does anyone have a view on this?
Emmet Hikory (persia) wrote : | #25 |
Given the nature of the Ubuntu archives, it's probably easier to have differently named packages (e.g. libpangomm-
Loïc Minier (lool) wrote : | #26 |
Dave, first, thanks for your list of libs; I'd like to refine the list here.
I think Matthias looked at building multiple libgcc but that didn't see to work too well; it might not be interesting either as libgcc is meant to be a software implementation and hence doesn't have code to use use a hardware FPU.
Concerning libstdc++ this might be an interesting and easier target albeit I don't know how much fp math it does; it might face the same issue as libgcc though, I'm not sure.
I don't think libgnome* are going to be too interesting.
I don't know how useful qt and mesa would be, I think these can be better optimized by implementing custom backends for the GPU one targets. I've seen recent activity on the beagleboard list to enable a SGX backend for Qt for instance, and I now the intrepid PowerVR/Poulsbo drivers use some new mesa path.
Everything font related is likely to be very floating-point bound, so pango and freetype are high on my list.
I'm a bit worried for Pango and Gtk+ as these dlopen some plugins and I don't know whether hwcap can be used there; the plugins aren't too important for most of the use of the libs though.
cairo/pixman: would certainly benefit as well; I think these can be tuned to use a DSP or GPU-specific opts.
libpng/libjpeg: no strong opinion.
xorg: no idea what in xorg would benefit from it exactly; perhaps Xft?
Loïc Minier (lool) wrote : | #27 |
Dave, on the use of multiple sources.list entries:
Note that http://
[Note that the mojo folks "abused" the "arm" port; AFAIK they built binaries for EABI using "arm" as Debian architecture, so they offer .debs which dpkg will allow to install on a Debian arm install despite being another ABI. That was the easiest path to get these interesting archives, but it breaks ABI, so not something to do officially IMO. What's possible however is to rebuild the armel archive with higher CPU requirements, keeping the same ABI.]
Loïc Minier (lool) wrote : | #28 |
Emmet, I don't think runtime detection is a good idea for FPU; you would end up reimplementing the hwcap mechanism to select between two flavours of the library IMO -- unless all floating point is already in a dlopen-ed plugin, but that's not really the case for the libs here.
Also, I think we might not split in separate packages for all libs; only the largest ones. For instance a separate libc6-i686 is warranted since libc6 is installed in all chroots / installs and /lib/tls/i686/cmov is 2.6 MB, but libspeex1 ships both /usr/lib/
Loïc Minier (lool) wrote : | #29 |
So I counted the number of "float" and "double" words in the source code of the various libraries proposed here with:
egrep -hrwc double\|float . | xargs | sed 's/ / + /g' | bc
that gave:
- pixman: 66 => likely no vfp version
- cairo: 2423 => will have a vfp version
- pango1.0: 683 => will have a vfp version
- gtk+2.0: 1430 => will have a vfp version
- ffmpeg-debian: 1679 => will have a vfp version
- xft: 46 => likely no vfp version
- freetype: 76 => likely no vfp version
this is all assuming that benchmarks don't contradict the above assumption (if benchmarks show a strong progress with vfp or no progress with vfp, it overrides the above list)
Note that for ffmpeg-debian the approach might be to build a vfp+neon version using shlibs alternate dependencies and providing separate packages.
Concerning liboil it needs runtime opts, not a new build.
Loïc Minier (lool) wrote : | #30 |
Let's discuss opts for karmic here; https:/
Changed in gcc-4.3: | |
status: | Confirmed → Won't Fix |
Loïc Minier (lool) wrote : | #31 |
So for karmic we're aiming at:
-march=armv6 -mtune=cortex-a8 -mfpu=vfp -mfloat-abi=softfp
Changed in gcc-4.3 (Ubuntu): | |
assignee: | nobody → doko |
status: | Confirmed → In Progress |
tags: | added: armel |
Loïc Minier (lool) wrote : | #32 |
This was implemented in karmic gcc-4.4 4.4.1-3ubuntu2
affects: | gcc-4.3 (Ubuntu) → gcc-4.4 (Ubuntu) |
Changed in gcc-4.4 (Ubuntu): | |
status: | In Progress → Fix Released |
Catalin Marinas schrieb: arm/linux- eabi.h file (used, arm/arm. c to define the default architecture
> The current patches from Debian applied to gcc change the default target
> to ARMv4t (arm9tdmi) in the gcc/config/
> generally, by gcc/config/
> target and CPU tuning).
the debian patch is not applied in the ubuntu build.