Ubuntu
glibc package

Bug #1663280
Activity log

Activity log for bug #1663280

Date	Who	What changed	Old value	New value	Message
2017-02-09 15:39:56	Oleg Strikov	bug			added bug
2017-02-09 15:43:00	Oleg Strikov	description	Serious performance degradation of math functions in 16.04/16.10/17.04 due to known Glibc bug Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182	Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182
2017-02-09 16:30:51	Launchpad Janitor	glibc (Ubuntu): status	New	Confirmed
2017-02-09 17:44:41	Marcel Stimberg	bug watch added		https://sourceware.org/bugzilla/show_bug.cgi?id=20508
2017-02-09 17:44:41	Marcel Stimberg	bug task added		glibc
2017-02-09 18:29:54	Bug Watch Updater	glibc: status	Unknown	Fix Released
2017-02-09 18:29:54	Bug Watch Updater	glibc: importance	Unknown	Medium
2017-02-10 13:40:56	Marcel Stimberg	bug watch added		https://bugzilla.redhat.com/show_bug.cgi?id=1421121
2017-02-10 13:40:56	Marcel Stimberg	bug task added		glibc (Fedora)
2017-02-11 14:34:03	Oleg Strikov	description	Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182	Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182 [5] http://www.agner.org/optimize/blog/read.php?i=761#761
2017-02-14 16:17:06	dino99	tags		upgrade-software-version xenial yakkety zesty
2017-02-14 18:08:57	Brian Murray	nominated for series		Ubuntu Zesty
2017-02-14 18:08:57	Brian Murray	bug task added		glibc (Ubuntu Zesty)
2017-02-15 16:55:54	Alberto Salvia Novella	summary	Serious performance degradation of math functions in 16.04/16.10/17.04 due to known Glibc bug	Serious performance degradation of math functions
2017-02-15 16:57:33	Alberto Salvia Novella	glibc (Ubuntu Zesty): importance	Undecided	Medium
2017-05-01 21:42:33	Brian Murray	glibc (Ubuntu): assignee		Matthias Klose (doko)
2017-07-23 07:13:42	Vinson Lee	bug			added subscriber Vinson Lee
2017-08-22 16:08:53	Luke Faraone	glibc (Ubuntu): status	Confirmed	Triaged
2017-08-22 16:08:59	Luke Faraone	glibc (Ubuntu Zesty): status	Confirmed	Triaged
2017-08-22 16:10:01	Luke Faraone	nominated for series		Ubuntu Xenial
2017-09-07 14:17:25	ake sandgren	bug			added subscriber ake sandgren
2017-09-07 16:32:34	Jesse Johnson	bug			added subscriber Jesse Johnson
2017-09-28 06:51:14	Vinson Lee	bug			added subscriber Steve Beattie
2017-10-27 08:11:56	Bug Watch Updater	glibc (Fedora): status	Unknown	Fix Released
2017-10-27 08:11:56	Bug Watch Updater	glibc (Fedora): importance	Unknown	Undecided
2017-10-27 08:12:00	Bug Watch Updater	bug watch added		https://sourceware.org/bugzilla/show_bug.cgi?id=20495
2018-04-04 16:40:01	Benjamin Peterson	bug			added subscriber Benjamin Peterson
2018-06-07 04:41:18	Daniel Axtens	bug			added subscriber Daniel Axtens
2018-06-08 13:50:35	David Coronel	bug			added subscriber David Coronel
2018-06-13 01:59:30	Daniel Axtens	bug watch added		https://sourceware.org/bugzilla/show_bug.cgi?id=20139
2018-06-13 04:58:28	Florian Weimer	bug watch added		https://sourceware.org/bugzilla/show_bug.cgi?id=21265
2018-06-21 15:42:45	Dimitri John Ledkov	bug task added		glibc (Ubuntu Xenial)
2018-06-21 15:43:52	Dimitri John Ledkov	glibc (Ubuntu Zesty): status	Triaged	Won't Fix
2018-06-21 15:43:56	Dimitri John Ledkov	glibc (Ubuntu): status	Triaged	Fix Released
2018-07-03 14:36:07	Dan Streetman	bug			added subscriber Dan Streetman
2018-10-02 06:12:45	Daniel Axtens	glibc (Ubuntu Xenial): status	New	Confirmed
2018-10-02 06:12:50	Daniel Axtens	glibc (Ubuntu Xenial): assignee		Daniel Axtens (daxtens)
2018-10-16 03:55:28	Daniel Axtens	description	Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182 [5] http://www.agner.org/optimize/blog/read.php?i=761#761	SRU Justification ================= [Impact] * Severe performance hit on many maths-heavy workloads. For example, a user reports linpack performance of 13 Gflops on Trusty and Bionic and 3.9 Gflops on Xenial. * Because the impact is so large (>3x) and Xenial is supported until 2021, the fix should be backported. * The fix avoids an AVX-SSE transition penalty. It stops _dl_runtime_resolve() from using AVX-256 instructions which touch the upper halves of various registers. This change means that the processor does not need to save and restore them. [Test Case] Firstly, you need a suitable Intel machine. Users report that Sandy Bridge, Ivy Bridge, Haswell, and Broadwell CPUs are affected, and I have been able to reproduce it on a Skylake CPU using a suitable Azure VM. Create the following C file, exp.c: #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm With the current version of glibc: $ time ./exp ... real 0m1.349s user 0m1.349s $ time LD_BIND_NOW=1 ./exp ... real 0m0.625s user 0m0.621s Observe that LD_BIND_NOW makes a big difference as it avoids the call to _dl_runtime_resolve. With the proposed update: $ time ./exp ... real 0m0.625s user 0m0.621s $ time LD_BIND_NOW=1 ./exp ... real 0m0.631s user 0m0.631s Observe that the normal case is faster, and LD_BIND_NOW makes a negligible difference. [Regression Potential] glibc is the nightmare case for regressions as could affect pretty much anything, and this patch touches a key part (dynamic libraries). We can be fairly confident in the fix generally - it's in the glibc in Bionic, Debian and some RPM-based distros. The backport is based on the patches in the release/2.23/master branch in the upstream glibc repository, and the backport was straightforward. Obviously that doesn't remove all risk. There is also a fair bit of Ubuntu-specific patching in glibc so other distros are of limited value for ruling out bugs. So I have done the following testing, and I'm happy to do more as required. All testing has been done: - on an Azure VM (affected by the change), with proposed package - on a local VM (not affected by the change), with proposed package * Boot with the upgraded libc6. * Watch a youtube video in Firefox over VNC. * Build some C code (debuild of zlib). * Test Java by installing and running Eclipse. Autopkgtest also passes. [Original Description] Bug [0] has been introduced in Glibc 2.23 [1] and fixed in Glibc 2.25 [2]. All Ubuntu versions starting from 16.04 are affected because they use either Glibc 2.23 or 2.24. Bug introduces serious (2x-4x) performance degradation of math functions (pow, exp/exp2/exp10, log/log2/log10, sin/cos/sincos/tan, asin/acos/atan/atan2, sinh/cosh/tanh, asinh/acosh/atanh) provided by libm. Bug can be reproduced on any AVX-capable x86-64 machine. @strikov: According to a quite reliable source [5] all AMD CPUs and latest Intel CPUs (Skylake and Knights Landing) don't suffer from AVX/SSE transition penalty. It means that the scope of this bug becomes smaller and includes only the following generations of Intel's CPUs: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. Scope still remains quite large though. @strikov: Ubuntu 16.10/17.04 which use Glibc 2.24 may recieve the fix from upstream 2.24 branch (as Marcel pointed out, fix has been backported to 2.24 branch where Fedora took it successfully) if such synchronization will take place. Ubuntu 16.04 (the main target of this bug) uses Glibc 2.23 which hasn't been patched upstream and will suffer from performance degradation until we fix it manually. This bug is all about AVX-SSE transition penalty [3]. 256-bit YMM registers used by AVX-256 instructions extend 128-bit registers used by SSE (XMM0 is a low half of YMM0 and so on). Every time CPU executes SSE instruction after AVX-256 instruction it has to store upper half of the YMM register to the internal buffer and then restore it when execution returns back to AVX instructions. Store/restore is required because old-fashioned SSE knows nothing about the upper halves of its registers and may damage them. This store/restore operation is time consuming (several tens of clock cycles for each operation). To deal with this issue, Intel introduced AVX-128 instructions which operate on the same 128-bit XMM register as SSE but take into account upper halves of YMM registers. Hence, no store/restore required. Practically speaking, AVX-128 instructions is a new smart form of SSE instructions which can be used together with full-size AVX-256 instructions without any penalty. Intel recommends to use AVX-128 instructions instead of SSE instructions wherever possible. To sum things up, it's okay to mix SSE with AVX-128 and AVX-128 with AVX-256. Mixing AVX-128 with AVX-256 is allowed because both types of instructions are aware of 256-bit YMM registers. Mixing SSE with AVX-128 is okay because CPU can guarantee that the upper halves of YMM registers don't contain any meaningful data (how one can put it there without using AVX-256 instructions) and avoid doing store/restore operation (why to care about random trash in the upper halves of the YMM registers). It's not okay to mix SSE with AVX-256 due to the transition penalty. Scalar floating-point instructions used by routines mentioned above are implemented as a subset of SSE and AVX-128 instructions. They operate on a small fraction of 128-bit register but still considered SSE/AVX-128 instruction. And they suffer from SSE/AVX transition penalty as well. Glibc inadvertently triggers a chain of AVX/SSE transition penalties due to inappropriate use of AVX-256 instructions inside _dl_runtime_resolve() procedure. By using AVX-256 instructions to push/pop YMM registers [4], Glibc makes CPU think that the upper halves of XMM registers contain meaningful data which needs to be preserved during execution of SSE instructions. With such a 'dirty' flag set every switch between SSE and AVX instructions (AVX-128 or AVX-256) leads to a time consuming store/restore procedure. This 'dirty' flag never gets cleared during the whole program execution which leads to a serious overall slowdown. Fixed implementation [2] of _dl_runtime_resolve() procedure tries to avoid using AVX-256 instructions if possible. Buggy _dl_runtime_resolve() gets called every time when dynamic linker tries to resolve a symbol (any symbol, not just ones mentioned above). It's enough for _dl_runtime_resolve() to be called just once to touch the upper halves of the YMM registers and provoke AVX/SSE transition penalty in the future. It's safe to say that all dynamically linked application call _dl_runtime_resolve() at least once which means that all of them may experience slowdown. Performance degradation takes place when such application mixes AVX and SSE instructions (switches from AVX to SSE or back). There are two types of math routines provided by libm: (a) ones that have AVX-optimized version (exp, sin/cos, tan, atan, log and other) (b) ones that don't have AVX-optimized version and rely on general purpose SSE implementation (pow, exp2/exp10, asin/acos, sinh/cosh/tanh, asinh/acosh/atanh and others) For the former group of routines slowdown happens when they get called from SSE code (i.e. from the application compiled with -mno-avx) because SSE -> AVX transition takes place. For the latter one slowdown happens when routines get called from AVX code (i.e. from the application compiled with -mavx) because AVX -> SSE transition takes place. Both situations look realistic. SSE code gets generated by gcc to target x86-64 and AVX-optimized code gets generated by gcc -march=native on AVX-capable machines. ============================================================================ Let's take one routine from the group (a) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += exp(b); printf("%f\n", a); return 0; } $ gcc -O3 -march=x86-64 -o exp exp.c -lm $ time ./exp <..> 2.801s <..> $ time LD_BIND_NOW=1 ./exp <..> 0.660s <..> You can see that application demonstrates 4x better performance when _dl_runtime_resolve() doesn't get called. That's how serious the impact of AVX/SSE transition can be. ============================================================================ Let's take one routine from the group (b) and try to reproduce the slowdown. #include <math.h> #include <stdio.h> int main () { double a, b; for (a = b = 0.0; b < 2.0; b += 0.00000005) a += pow(M_PI, b); printf("%f\n", a); return 0; } # note that -mavx option has been passed $ gcc -O3 -march=x86-64 -mavx -o pow pow.c -lm $ time ./pow <..> 4.157s <..> $ time LD_BIND_NOW=1 ./pow <..> 2.123s <..> You can see that application demonstrates 2x better performance when _dl_runtime_resolve() doesn't get called. ============================================================================ [!] It's important to mention that the context of this bug might be even wider. After a call to buggy _dl_runtime_resolve() any transition between AVX-128 and SSE (otherwise legitimate) will suffer from performance degradation. Any application which mixes AVX-128 floating point code with SSE floating point code (e.g. by using external SSE-only library) will experience serious slowdown. [0] https://sourceware.org/bugzilla/show_bug.cgi?id=20495 [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=f3dcae82d54e5097e18e1d6ef4ff55c2ea4e621e [2] https://sourceware.org/git/?p=glibc.git;a=commit;h=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604 [3] https://software.intel.com/en-us/articles/intel-avx-state-transitions-migrating-sse-code-to-avx [4] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/dl-trampoline.h;h=d6c7f989b5e74442cacd75963efdc6785ac6549d;hb=fb0f7a6755c1bfaec38f490fbfcaa39a66ee3604#l182 [5] http://www.agner.org/optimize/blog/read.php?i=761#761
2018-10-16 04:11:17	Launchpad Janitor	merge proposal linked		https://code.launchpad.net/~daxtens/ubuntu/+source/glibc/+git/glibc/+merge/356779
2018-10-16 12:48:50	Dan Streetman	bug			added subscriber Ubuntu Sponsors Team
2018-10-19 04:05:03	Daniel Axtens	tags	upgrade-software-version xenial yakkety zesty	sts upgrade-software-version xenial yakkety zesty
2018-10-19 04:05:19	Daniel Axtens	bug			added subscriber STS Sponsors
2018-11-03 23:54:33	Mathew Hodson	glibc (Ubuntu Xenial): importance	Undecided	Medium
2018-11-24 23:48:19	Mathew Hodson	glibc (Ubuntu Xenial): status	Confirmed	In Progress
2018-12-14 18:38:15	Fabio Augusto Miranda Martins	glibc (Ubuntu Xenial): importance	Medium	Low
2019-02-04 12:37:50	Dan Streetman	glibc (Ubuntu Xenial): importance	Low	High
2019-02-05 18:48:53	Dan Streetman	removed subscriber Ubuntu Sponsors Team
2019-02-05 18:51:49	Brian Murray	glibc (Ubuntu Xenial): status	In Progress	Fix Committed
2019-02-05 18:51:51	Brian Murray	bug			added subscriber Ubuntu Stable Release Updates Team
2019-02-05 18:51:56	Brian Murray	bug			added subscriber SRU Verification
2019-02-05 18:52:03	Brian Murray	tags	sts upgrade-software-version xenial yakkety zesty	sts upgrade-software-version verification-needed verification-needed-xenial xenial yakkety zesty
2019-02-13 16:41:11	Dan Streetman	tags	sts upgrade-software-version verification-needed verification-needed-xenial xenial yakkety zesty	sts upgrade-software-version verification-done verification-done-xenial xenial
2019-02-20 08:16:47	Eric Desrochers	removed subscriber STS Sponsors
2019-02-20 15:42:16	Łukasz Zemczak	removed subscriber Ubuntu Stable Release Updates Team
2019-02-20 15:52:18	Launchpad Janitor	glibc (Ubuntu Xenial): status	Fix Committed	Fix Released

Ubuntuglibc package

Activity log for bug #1663280

Ubuntu
glibc package