Add Denormal prevention in engine code

Bug #1404401 reported by Daniel Schürmann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mixxx
Fix Released
Medium
Daniel Schürmann

Bug Description

It looks like we need to add it, because processing denormals may cost 100 times more CPU.

Revision history for this message
Daniel Schürmann (daschuer) wrote :
Revision history for this message
Daniel Schürmann (daschuer) wrote :
Changed in mixxx:
milestone: none → 1.12.0
Revision history for this message
Owen Williams (ywwg) wrote :

I removed out denormal code earlier this year because it caused horrible audible spikes in the EQ filters as the wave approached zero due to waveform discontinuity. Partially this was because our denormal code was way outside the limits of where this cpu penalty is actually applied. When RJ and I looked closer, we already use a gcc flag to disable ultra-small values anyway, so the compiler is denormaling for us.

Revision history for this message
Owen Williams (ywwg) wrote :

If you do want to pursue this change, I would require a battery of tests that show that none of our filters are thrown off by the change in the sound wave. But I would urge you to confirm that it's actually a problem by demonstrating the CPU impact first before writing a bunch of new code.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Can you recall which gcc Flag it is? I cannot find it.

http://frozenfractal.com/blog/2010/3/11/optimization-story/:

"
Luckily, there is an instruction to change the CPU’s behaviour: instead of storing denormalized values, these can simply be flushed to 0. Unfortunately, there is no standard library function for this. On Visual C++, we can do this:

_controlfp(_MCW_DN, _DN_FLUSH);

On gcc, we need some inline assembly. This was my first x86 assembly ever:

int mxcsr;
__asm__("stmxcsr %0" : "=m"(mxcsr) : :);
mxcsr |= (1 << 15); // set bit 15: flush-to-zero mode
__asm__("ldmxcsr %0" : : "m"(mxcsr) :);
"

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

Hm, I had forgotten about that Owen:

The flag is -ffast-math -- according to the GCC docs here it enables flush-to-zero on some platforms though it isn't specific about which ones:
https://gcc.gnu.org/wiki/FloatingPointMath

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

On my MBP (x86_64) adding FTZ in an experiment didn't have much effect.

1) Turn off waveforms
2) Load song, wait for analysis to complete
3) adjust EQs to non-neutral
4) play song
5) wait
6) record 40 seconds of base
7) record 40 seconds of experiment

Revision history for this message
RJ Skerry-Ryan (rryan) wrote :

Patch for the above experiment.

Revision history for this message
Owen Williams (ywwg) wrote :

Here's the other documentation I was using -- ffast-math + sse flags: http://carlh.net/plugins/denormals.php

Revision history for this message
Daniel Schürmann (daschuer) wrote :

the filter code suffers denormals.
I have checked this by adding

    if (!std::isnormal(buf[3])) {
        qDebug() << "denormal";
    }

Changed in mixxx:
status: New → Confirmed
Revision history for this message
Daniel Schürmann (daschuer) wrote :

@RJ: Did you use a sse3 build?
It might be possible that your code does nothing on default builds
see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21408

Revision history for this message
Owen Williams (ywwg) wrote :

The last time someone tried to "fix" denormals, it caused major audio artifacts. If we're going to try to do this again, we're going to need:
* proof that denormals are causing detectable CPU hits
* proof that the fix does not cause audio artifacts.

This work should wait for post-release.

Changed in mixxx:
milestone: 1.12.0 → 2.1
Revision history for this message
Daniel Schürmann (daschuer) wrote :

@Owen: What did you do last time to flush denormals?

I have just read that denormals are flushed by default on Mac Os audio callback.
So it can't be that bad.
@RJ, can you verify that?

Revision history for this message
Owen Williams (ywwg) wrote :

The previous fix involved checking to see if the value was within abs(.000001) or some incredibly insufficiently small number, and just set the value to 0 if so It was a really bad fix that was not tested well

Revision history for this message
Daniel Schürmann (daschuer) wrote :

It look like we can rely on this:
http://carlh.net/plugins/denormals.php

@RJ, what does it men for the default Mixxx optimization flags?

Revision history for this message
Daniel Schürmann (daschuer) wrote :

My results:

Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
subnormal 12.0642 times slower.
-ffast-math enabled subnormal 1.14059 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 12.0405 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.999144 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 12.0839 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 1.00153 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.01593 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.0017 times slower.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

hread-Modell: posix
gcc-Version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
subnormal 12.0694 times slower.
-ffast-math enabled subnormal 1.01273 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 11.8485 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.991461 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 13.0207 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.96152 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.00095 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.04673 times slower.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

On the same device, but a Virtual 32 bit OS:

Thread model: posix
gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux)
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
subnormal 23.8905 times slower.
-ffast-math enabled subnormal 24.4369 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 10.5012 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.958384 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 11.4803 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.998909 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.990018 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.00707 times slower.

Revision history for this message
Owen Williams (ywwg) wrote :

I'm not quite sure what all the acronyms are, would you suggest a change in our build flags?

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Probably yes. I am still missing a test on my 32 bit Atom Netbook too prove it. But I think we have already enough data for a conclusion.

1.) Since our Filters are Infinite, they will produce denominals. I have proved it by a test.
2.) Because of the -ffast-math flag Mixxx 64 bit builds, have no penalty by denormals. I have pr roved this on my devices and the cloumn at http://carlh.net/plugins/denormals.php is green for 64 bit CPUs and -ffast-math only.
3.) There is a performance penalty on 32 bit Mixxx builds even tough the -ffast-math flag is set. We need to enable sse and set the DAZ flag to have the same benefit as the 64 bit build.

I do not know, big the relation to the entire CPU time in the Audio callback is, (will test it later) but since we have a solution for 64 bit, we should solve the issue for 32 bit as well.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
subnormal 45.0023 times slower.
-ffast-math enabled subnormal 37.7372 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 19.1181 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.998239 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 18.8671 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.999303 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.999885 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 0.998566 times slower.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Test results from my Atom Notebook

Build with default settings

Debug [Main]:
Stat("LinkwitzRiley8EQEffect","count=2393,sum=7.27838e+09ns,average=3.04153e+06ns,min=910102ns,max=5.17381e+07ns,variance=5.35149e+13ns^2,stddev=7.31539e+06ns")
Debug [Main]: Stat("EngineMaster::process_duration","count=2678,average=3.44708e+06ns,min=207010ns,max=6.8616e+07ns,variance=5.4347e+13ns^2,stddev=7.37204e+06ns")

Build with scons -j2 optimize=2

Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=2200,sum=3.39283e+09ns,average=1.5422e+06ns,min=699111ns,max=2.511e+07ns,variance=7.1921e+12ns^2,stddev=2.68181e+06ns")
Debug [Main]: Stat("EngineMaster::process_duration","count=2448,average=2.35673e+06ns,min=188362ns,max=2.64581e+07ns,variance=8.66118e+12ns^2,stddev=2.94299e+06ns")

The average time for the EQ is nearly the 1/2 of the non sse version.
Interesting is that the max value is also doubled and not x 20 as we might expect by the denormals calulations.

Conclusion:
There is a BIG benefit of SSE 32 bit builds.
This should be the default for source builds.

For binary distributions, we should strongly consider to drop Pentium 3 support.
.. or offer sse and non sse builds.

It might be a problem for the Linux distros to drop Pentium 3 :-/

Revision history for this message
Owen Williams (ywwg) wrote :

I would have no problem with dropping pentium 3. Even "old" netbooks are still going to have an Atom or Celeron or more modern CPU than a pentium 3.

Revision history for this message
Owen Williams (ywwg) wrote :

Thanks for doing this research, daniel!

Revision history for this message
Daniel Schürmann (daschuer) wrote :

It looks like DAZ is standard on armhf builds ...

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

"
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
"

http://stackoverflow.com/questions/7346521/subnormal-ieee-754-floating-point-numbers-support-on-ios-arm-devices-iphone-4

Revision history for this message
Daniel Schürmann (daschuer) wrote :

This is the result on the same hardware as above using "optimize=2" and just -O2

Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=4418,sum=2.54514e+10ns,average=5.76084e+06ns,min=743041ns,max=3.95725e+07ns,variance=8.41122e+13ns^2,stddev=9.17127e+06ns")
Debug [Main]: Debug [Main]: Stat("EngineMaster::process_duration","count=4727,average=6.27221e+06ns,min=201701ns,max=4.46743e+07ns,variance=8.15258e+13ns^2,stddev=9.02916e+06ns")

The filter and Engine code takes ~3 times more.
Conclusion: it is a good idea to use -O3 + -funroll-loops

Revision history for this message
Daniel Schürmann (daschuer) wrote :
Download full text (7.0 KiB)

An Yes, we need RJs patch

I get heavy load if I play a track in one deck using Linkwitz-Riley EQ and turn Gain the to zero.
With the patch, there is no load change when turning to Zero.
I can see similar results on I5 Notebook x64 with small Audiobuffers.

Enabling DAZ helps. I do not have a clue why this happens on a SSE2 build? According to the test above this does not happen ...
So there seams to be an other issue.

Debug [Main]: =====================================
Debug [Main]: BASE STATS
Debug [Main]: =====================================
Debug [Main]: Stat("AnalyserQueue process","count=1")
Debug [Main]: Stat("CachingReaderWorker [Channel1]","count=574")
Debug [Main]: Stat("CachingReaderWorker [Channel2]","count=1")
Debug [Main]: Stat("CachingReaderWorker [PreviewDeck1]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler1]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler2]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler3]","count=1")
Debug [Main]: Stat("CachingReaderWorker [Sampler4]","count=1")
Debug [Main]: Stat("EngineBuffer::process_pauselock","count=2481,sum=1.97225e+08ns,average=79494.2ns,min=30311ns,max=1.09239e+06ns,variance=3.04507e+09ns^2,stddev=55182.2ns")
Debug [Main]: Stat("EngineMaster::mixChannels_0active","count=7443,sum=4.85571e+07ns,average=6523.86ns,min=3282ns,max=730540ns,variance=1.4778e+08ns^2,stddev=12156.5ns")
Debug [Main]: Stat("EngineMaster::mixChannels_1active","count=2481,sum=3.467e+07ns,average=13974.2ns,min=7333ns,max=833346ns,variance=6.11274e+08ns^2,stddev=24724ns")
Debug [Main]: Stat("EngineMaster::process","count=4962")
Debug [Main]: Stat("EngineMaster::processChannels","count=2480,sum=2.2165e+10ns,average=8.9375e+06ns,min=960248ns,max=5.28434e+07ns,variance=1.24014e+14ns^2,stddev=1.11361e+07ns")
Debug [Main]: Stat("EngineSideChain","count=191")
Debug [Main]: Stat("EngineSideChain::process","count=192")
Debug [Main]: Stat("EngineSideChain::writeSamples","count=4962")
Debug [Main]: Stat("EngineSideChain::writeSamples wake up","count=190")
Debug [Main]: Stat("EngineWorkerScheduler","count=572")
Debug [Main]: Stat("LinkwitzRiley8EQEffect","count=2480,sum=2.11762e+10ns,average=8.5388e+06ns,min=743949ns,max=5.25722e+07ns,variance=1.24491e+14ns^2,stddev=1.11575e+07ns")
Debug [Main]: Stat("MixxxMainWindow::~MixxxMainWindow","count=1,sum=1.13623e+09ns,average=1.13623e+09ns,min=1.13623e+09ns,max=1.13623e+09ns,variance=0ns^2,stddev=0ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess output 0, HDA Intel: ALC269 Analog (hw:0,0)","count=2481,sum=1.89776e+08ns,average=76491.7ns,min=22628ns,max=1.06907e+07ns,variance=2.21524e+11ns^2,stddev=470664ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcess prepare 0, HDA Intel: ALC269 Analog (hw:0,0)","count=2480,sum=2.34364e+10ns,average=9.45015e+06ns,min=1.14386e+06ns,max=5.31427e+07ns,variance=1.24151e+14ns^2,stddev=1.11423e+07ns")
Debug [Main]: Stat("SoundDevicePortAudio::callbackProcessClkRef 0, HDA Intel: ALC269 Analog (hw:0,0)","count=4962")
Debug [Main]: Stat("VsyncThread real time error","count=17,sum=17,average=1,min=1,max=1,variance=0^2,stddev=0")
Debug [Main]: Stat("VsyncThread usleep for VSync","coun...

Read more...

Revision history for this message
Daniel Schürmann (daschuer) wrote :

There seams to be a mess around SSE2 / SSE3 and Pentium4

See:

http://sourceforge.net/p/lmms/mailman/message/32988535/
"
3. Some (not sure how many) 32-bit CPUs with SSE2 don't have the DAZ
flag and will crash the program trying to set it.
"

https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz
"
Initial steppings of Pentium® 4 processors did not support DAZ
"

Revision history for this message
Owen Williams (ywwg) wrote :

Are there any pentium 4 laptops out there? Realistically, how many people might this affect? I would venture to guess ~zero

Revision history for this message
Owen Williams (ywwg) wrote :

I couldn't find a good hardware survey that showed breakdown by processor model. I also googled around to find out which steppings do or do not support DAZ. The pentium4 was in production from 2000-2008, but I'd guess that it's the really really old models that would have trouble with this mode.

This document does show how to detect if the mode is supported, if it comes to that:
http://datasheets.chipdb.org/Intel/x86/CPUID/24161817.pdf

We should have at least one build of mixxx that is super-safe 32bit no special flags, just in case. But I'm still leaning toward the default build using DAZ mode.

Revision history for this message
Daniel Schürmann (daschuer) wrote :

Cool, this doc verifies the DAZ issue :-/
It could be a lot of fun to port this dazdetect.asm to gcc and mvc.
But I am in doubt if this is worth the time.

For now we have the "portable" build for sse2 cpus with DAZ flag or no flag but not crashing when enabling and the "legacy" build for all older CPU.

Revision history for this message
Daniel Schürmann (daschuer) wrote :
Changed in mixxx:
status: Confirmed → Fix Committed
milestone: 2.1 → 1.12.0
assignee: nobody → Daniel Schürmann (daschuer)
importance: Undecided → Medium
RJ Skerry-Ryan (rryan)
Changed in mixxx:
status: Fix Committed → Fix Released
Revision history for this message
Swiftb0y (swiftb0y) wrote :

Mixxx now uses GitHub for bug tracking. This bug has been migrated to:
https://github.com/mixxxdj/mixxx/issues/7747

lock status: Metadata changes locked and limited to project staff
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.