Add Denormal prevention in engine code
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mixxx |
Fix Released
|
Medium
|
Daniel Schürmann |
Bug Description
It looks like we need to add it, because processing denormals may cost 100 times more CPU.
Daniel Schürmann (daschuer) wrote : | #1 |
Daniel Schürmann (daschuer) wrote : | #2 |
Changed in mixxx: | |
milestone: | none → 1.12.0 |
Owen Williams (ywwg) wrote : | #3 |
I removed out denormal code earlier this year because it caused horrible audible spikes in the EQ filters as the wave approached zero due to waveform discontinuity. Partially this was because our denormal code was way outside the limits of where this cpu penalty is actually applied. When RJ and I looked closer, we already use a gcc flag to disable ultra-small values anyway, so the compiler is denormaling for us.
Owen Williams (ywwg) wrote : | #4 |
If you do want to pursue this change, I would require a battery of tests that show that none of our filters are thrown off by the change in the sound wave. But I would urge you to confirm that it's actually a problem by demonstrating the CPU impact first before writing a bunch of new code.
Daniel Schürmann (daschuer) wrote : | #5 |
Can you recall which gcc Flag it is? I cannot find it.
http://
"
Luckily, there is an instruction to change the CPU’s behaviour: instead of storing denormalized values, these can simply be flushed to 0. Unfortunately, there is no standard library function for this. On Visual C++, we can do this:
_controlfp(_MCW_DN, _DN_FLUSH);
On gcc, we need some inline assembly. This was my first x86 assembly ever:
int mxcsr;
__asm__("stmxcsr %0" : "=m"(mxcsr) : :);
mxcsr |= (1 << 15); // set bit 15: flush-to-zero mode
__asm__("ldmxcsr %0" : : "m"(mxcsr) :);
"
RJ Skerry-Ryan (rryan) wrote : | #6 |
Hm, I had forgotten about that Owen:
The flag is -ffast-math -- according to the GCC docs here it enables flush-to-zero on some platforms though it isn't specific about which ones:
https:/
RJ Skerry-Ryan (rryan) wrote : | #7 |
RJ Skerry-Ryan (rryan) wrote : | #8 |
Owen Williams (ywwg) wrote : | #9 |
Here's the other documentation I was using -- ffast-math + sse flags: http://
Daniel Schürmann (daschuer) wrote : | #10 |
the filter code suffers denormals.
I have checked this by adding
if (!std::
qDebug() << "denormal";
}
Changed in mixxx: | |
status: | New → Confirmed |
Daniel Schürmann (daschuer) wrote : | #11 |
@RJ: Did you use a sse3 build?
It might be possible that your code does nothing on default builds
see:
https:/
Owen Williams (ywwg) wrote : | #12 |
The last time someone tried to "fix" denormals, it caused major audio artifacts. If we're going to try to do this again, we're going to need:
* proof that denormals are causing detectable CPU hits
* proof that the fix does not cause audio artifacts.
This work should wait for post-release.
Changed in mixxx: | |
milestone: | 1.12.0 → 2.1 |
Daniel Schürmann (daschuer) wrote : | #13 |
@Owen: What did you do last time to flush denormals?
I have just read that denormals are flushed by default on Mac Os audio callback.
So it can't be that bad.
@RJ, can you verify that?
Owen Williams (ywwg) wrote : | #14 |
The previous fix involved checking to see if the value was within abs(.000001) or some incredibly insufficiently small number, and just set the value to 0 if so It was a really bad fix that was not tested well
Daniel Schürmann (daschuer) wrote : | #15 |
It look like we can rely on this:
http://
@RJ, what does it men for the default Mixxx optimization flags?
Daniel Schürmann (daschuer) wrote : | #16 |
My results:
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
model name : Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
subnormal 12.0642 times slower.
-ffast-math enabled subnormal 1.14059 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 12.0405 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.999144 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 12.0839 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 1.00153 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.01593 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.0017 times slower.
Daniel Schürmann (daschuer) wrote : | #17 |
hread-Modell: posix
gcc-Version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
subnormal 12.0694 times slower.
-ffast-math enabled subnormal 1.01273 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 11.8485 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.991461 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 13.0207 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.96152 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 1.00095 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.04673 times slower.
Daniel Schürmann (daschuer) wrote : | #18 |
On the same device, but a Virtual 32 bit OS:
Thread model: posix
gcc version 4.3.1 20080507 (prerelease) [gcc-4_3-branch revision 135036] (SUSE Linux)
model name : Intel(R) Core(TM) i5 CPU M 560 @ 2.67GHz
subnormal 23.8905 times slower.
-ffast-math enabled subnormal 24.4369 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 10.5012 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.958384 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 11.4803 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.998909 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.990018 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 1.00707 times slower.
Owen Williams (ywwg) wrote : | #19 |
I'm not quite sure what all the acronyms are, would you suggest a change in our build flags?
Daniel Schürmann (daschuer) wrote : | #20 |
Probably yes. I am still missing a test on my 32 bit Atom Netbook too prove it. But I think we have already enough data for a conclusion.
1.) Since our Filters are Infinite, they will produce denominals. I have proved it by a test.
2.) Because of the -ffast-math flag Mixxx 64 bit builds, have no penalty by denormals. I have pr roved this on my devices and the cloumn at http://
3.) There is a performance penalty on 32 bit Mixxx builds even tough the -ffast-math flag is set. We need to enable sse and set the DAZ flag to have the same benefit as the 64 bit build.
I do not know, big the relation to the entire CPU time in the Audio callback is, (will test it later) but since we have a solution for 64 bit, we should solve the issue for 32 bit as well.
Daniel Schürmann (daschuer) wrote : | #21 |
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
model name : Intel(R) Atom(TM) CPU N270 @ 1.60GHz
subnormal 45.0023 times slower.
-ffast-math enabled subnormal 37.7372 times slower.
SSE enabled FTZ=0 DAZ=0 subnormal 19.1181 times slower.
SSE enabled FTZ=0 DAZ=1 subnormal 0.998239 times slower.
SSE enabled FTZ=1 DAZ=0 subnormal 18.8671 times slower.
SSE enabled FTZ=0 DAZ=0 -ffast-math enabled subnormal 0.999303 times slower.
SSE enabled FTZ=0 DAZ=1 -ffast-math enabled subnormal 0.999885 times slower.
SSE enabled FTZ=1 DAZ=0 -ffast-math enabled subnormal 0.998566 times slower.
Daniel Schürmann (daschuer) wrote : | #22 |
Test results from my Atom Notebook
Build with default settings
Debug [Main]:
Stat("LinkwitzR
Debug [Main]: Stat("EngineMas
Build with scons -j2 optimize=2
Debug [Main]: Stat("LinkwitzR
Debug [Main]: Stat("EngineMas
The average time for the EQ is nearly the 1/2 of the non sse version.
Interesting is that the max value is also doubled and not x 20 as we might expect by the denormals calulations.
Conclusion:
There is a BIG benefit of SSE 32 bit builds.
This should be the default for source builds.
For binary distributions, we should strongly consider to drop Pentium 3 support.
.. or offer sse and non sse builds.
It might be a problem for the Linux distros to drop Pentium 3 :-/
Owen Williams (ywwg) wrote : | #23 |
I would have no problem with dropping pentium 3. Even "old" netbooks are still going to have an Atom or Celeron or more modern CPU than a pentium 3.
Owen Williams (ywwg) wrote : | #24 |
Thanks for doing this research, daniel!
Daniel Schürmann (daschuer) wrote : | #25 |
It looks like DAZ is standard on armhf builds ...
https:/
"
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=‘neon’), note that floating-point operations are not generated by GCC's auto-vectorization pass unless -funsafe-
"
Daniel Schürmann (daschuer) wrote : | #26 |
This is the result on the same hardware as above using "optimize=2" and just -O2
Debug [Main]: Stat("LinkwitzR
Debug [Main]: Debug [Main]: Stat("EngineMas
The filter and Engine code takes ~3 times more.
Conclusion: it is a good idea to use -O3 + -funroll-loops
Daniel Schürmann (daschuer) wrote : | #27 |
An Yes, we need RJs patch
I get heavy load if I play a track in one deck using Linkwitz-Riley EQ and turn Gain the to zero.
With the patch, there is no load change when turning to Zero.
I can see similar results on I5 Notebook x64 with small Audiobuffers.
Enabling DAZ helps. I do not have a clue why this happens on a SSE2 build? According to the test above this does not happen ...
So there seams to be an other issue.
Debug [Main]: =======
Debug [Main]: BASE STATS
Debug [Main]: =======
Debug [Main]: Stat("AnalyserQueue process","count=1")
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("CachingRe
Debug [Main]: Stat("EngineBuf
Debug [Main]: Stat("EngineMas
Debug [Main]: Stat("EngineMas
Debug [Main]: Stat("EngineMas
Debug [Main]: Stat("EngineMas
Debug [Main]: Stat("EngineSid
Debug [Main]: Stat("EngineSid
Debug [Main]: Stat("EngineSid
Debug [Main]: Stat("EngineSid
Debug [Main]: Stat("EngineWor
Debug [Main]: Stat("LinkwitzR
Debug [Main]: Stat("MixxxMain
Debug [Main]: Stat("SoundDevi
Debug [Main]: Stat("SoundDevi
Debug [Main]: Stat("SoundDevi
Debug [Main]: Stat("VsyncThread real time error",
Debug [Main]: Stat("VsyncThread usleep for VSync","coun...
Daniel Schürmann (daschuer) wrote : | #28 |
There seams to be a mess around SSE2 / SSE3 and Pentium4
See:
http://
"
3. Some (not sure how many) 32-bit CPUs with SSE2 don't have the DAZ
flag and will crash the program trying to set it.
"
https:/
"
Initial steppings of Pentium® 4 processors did not support DAZ
"
Owen Williams (ywwg) wrote : | #29 |
Are there any pentium 4 laptops out there? Realistically, how many people might this affect? I would venture to guess ~zero
Owen Williams (ywwg) wrote : | #30 |
I couldn't find a good hardware survey that showed breakdown by processor model. I also googled around to find out which steppings do or do not support DAZ. The pentium4 was in production from 2000-2008, but I'd guess that it's the really really old models that would have trouble with this mode.
This document does show how to detect if the mode is supported, if it comes to that:
http://
We should have at least one build of mixxx that is super-safe 32bit no special flags, just in case. But I'm still leaning toward the default build using DAZ mode.
Daniel Schürmann (daschuer) wrote : | #31 |
Cool, this doc verifies the DAZ issue :-/
It could be a lot of fun to port this dazdetect.asm to gcc and mvc.
But I am in doubt if this is worth the time.
For now we have the "portable" build for sse2 cpus with DAZ flag or no flag but not crashing when enabling and the "legacy" build for all older CPU.
Daniel Schürmann (daschuer) wrote : | #32 |
Changed in mixxx: | |
status: | Confirmed → Fix Committed |
milestone: | 2.1 → 1.12.0 |
assignee: | nobody → Daniel Schürmann (daschuer) |
importance: | Undecided → Medium |
Changed in mixxx: | |
status: | Fix Committed → Fix Released |
Swiftb0y (swiftb0y) wrote : | #33 |
Mixxx now uses GitHub for bug tracking. This bug has been migrated to:
https:/
lock status: | Metadata changes locked and limited to project staff |
Links from IRC: musicdsp. org/files/ denormal. pdf ldesoras. free.fr/ doc/articles/ denormal- en.pdf
http://
http://