Ubuntu
glibc package

Bug #1928508
Activity log

Activity log for bug #1928508

Date	Who	What changed	Old value	New value	Message
2021-05-14 18:59:18	Heitor Alves de Siqueira	bug			added bug
2021-05-14 18:59:38	Heitor Alves de Siqueira	nominated for series		Ubuntu Groovy
2021-05-14 18:59:38	Heitor Alves de Siqueira	bug task added		glibc (Ubuntu Groovy)
2021-05-14 18:59:38	Heitor Alves de Siqueira	nominated for series		Ubuntu Focal
2021-05-14 18:59:38	Heitor Alves de Siqueira	bug task added		glibc (Ubuntu Focal)
2021-05-14 18:59:43	Heitor Alves de Siqueira	glibc (Ubuntu Focal): importance	Undecided	High
2021-05-14 18:59:44	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): importance	Undecided	High
2021-05-14 18:59:46	Heitor Alves de Siqueira	glibc (Ubuntu Focal): status	New	Confirmed
2021-05-14 18:59:47	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): status	New	Won't Fix
2021-05-14 18:59:49	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): status	Won't Fix	Confirmed
2021-05-14 18:59:52	Heitor Alves de Siqueira	glibc (Ubuntu): status	New	Fix Released
2021-05-14 18:59:54	Heitor Alves de Siqueira	glibc (Ubuntu Focal): assignee		Heitor Alves de Siqueira (halves)
2021-05-14 18:59:56	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): assignee		Heitor Alves de Siqueira (halves)
2021-05-14 19:01:51	Heitor Alves de Siqueira	description	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.
2021-05-14 19:04:17	Heitor Alves de Siqueira	description	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.
2021-05-14 20:36:34	Heitor Alves de Siqueira	attachment added		test_memcpy.c https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5497631/+files/test_memcpy.c
2021-05-14 21:48:14	Pedro Principeza	bug			added subscriber Pedro Principeza
2021-05-18 15:03:45	Dan Streetman	bug			added subscriber Dan Streetman
2021-05-18 17:57:30	Heitor Alves de Siqueira	description	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.
2021-05-28 13:40:09	Heitor Alves de Siqueira	glibc (Ubuntu Focal): status	Confirmed	In Progress
2021-05-28 13:40:11	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): status	Confirmed	In Progress
2021-06-07 17:49:37	Heitor Alves de Siqueira	attachment added		lp1928508-focal.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5502960/+files/lp1928508-focal.debdiff
2021-06-07 17:49:56	Heitor Alves de Siqueira	attachment added		lp1928508-groovy.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5502961/+files/lp1928508-groovy.debdiff
2021-06-07 17:50:20	Heitor Alves de Siqueira	bug			added subscriber STS Sponsors
2021-06-08 13:39:06	Heitor Alves de Siqueira	attachment removed	lp1928508-focal.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5502960/+files/lp1928508-focal.debdiff
2021-06-08 13:39:24	Heitor Alves de Siqueira	attachment removed	lp1928508-groovy.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5502961/+files/lp1928508-groovy.debdiff
2021-06-08 13:39:56	Heitor Alves de Siqueira	attachment added		lp1928508-focal-v2.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5503157/+files/lp1928508-focal-v2.debdiff
2021-06-08 13:40:08	Heitor Alves de Siqueira	attachment added		lp1928508-groovy-v2.debdiff https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1928508/+attachment/5503158/+files/lp1928508-groovy-v2.debdiff
2021-06-10 17:05:44	Heitor Alves de Siqueira	description	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression.	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This issue has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression. glibc exports this specific variable as a tunable, so we could also tweak it with the GLIBC_TUNABLES env var: $ hyperfine -n clean-env 'lxc exec focal env ./test_memcpy64 32' -n tunables 'lxc exec focal env GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=102410243*4 ./test_memcpy64 32' Benchmark #1: clean-env Time (mean ± σ): 2.529 s ± 0.061 s [User: 6.0 ms, System: 4.7 ms] Range (min … max): 2.457 s … 2.615 s 10 runs Benchmark #2: tunables Time (mean ± σ): 1.427 s ± 0.030 s [User: 6.5 ms, System: 3.8 ms] Range (min … max): 1.402 s … 1.482 s 10 runs Summary 'tunables' ran 1.77 ± 0.06 times faster than 'clean-env' This solution is not ideal, but it offers a secondary way of fixing the performance issues. However, the speed gains for memcpy() are noticeable enough that we should strongly consider changing the defaults in the Focal LTS release, so that it performs similarly to Bionic and future Ubuntu releases starting with Hirsute.
2021-06-15 19:35:04	Heitor Alves de Siqueira	glibc (Ubuntu Groovy): status	In Progress	Won't Fix
2021-10-01 11:58:28	Dan Streetman	removed subscriber STS Sponsors
2021-11-29 02:27:50	Michael Hudson-Doyle	description	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/ [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This issue has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression. glibc exports this specific variable as a tunable, so we could also tweak it with the GLIBC_TUNABLES env var: $ hyperfine -n clean-env 'lxc exec focal env ./test_memcpy64 32' -n tunables 'lxc exec focal env GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=102410243*4 ./test_memcpy64 32' Benchmark #1: clean-env Time (mean ± σ): 2.529 s ± 0.061 s [User: 6.0 ms, System: 4.7 ms] Range (min … max): 2.457 s … 2.615 s 10 runs Benchmark #2: tunables Time (mean ± σ): 1.427 s ± 0.030 s [User: 6.5 ms, System: 3.8 ms] Range (min … max): 1.402 s … 1.482 s 10 runs Summary 'tunables' ran 1.77 ± 0.06 times faster than 'clean-env' This solution is not ideal, but it offers a secondary way of fixing the performance issues. However, the speed gains for memcpy() are noticeable enough that we should strongly consider changing the defaults in the Focal LTS release, so that it performs similarly to Bionic and future Ubuntu releases starting with Hirsute.	[Impact] On AMD Zen systems, memcpy() calls see a heavy performance regression in Focal and Groovy, due to the way __x86_non_temporal_threshold is calculated. Before 'glibc-2.33~455', cache values were calculated taking into consideration the number of hardware threads in the CPU. On AMD Ryzen and EPYC systems, this can be counter-productive if the number of threads is high enough for the last-level caches to "overrun" each other and cause cache line flushes. The solution is to reduce the allocated size for these non_temporal stores, removing the number of threads from the equation. [Test Plan] Compile the test_memcpy.c that is attached to this bug report: $ gcc -mtune=generic -march=x86-64 -g -O3 test_memcpy.c -o test_memcpy64 This should be run before and after installing the libc packages from proposed. On Ryzen and EPYC systems a substantial improvement should be seen and on other systems, no significant change should be seen. [Where problems could occur] Since we're messing with the cacheinfo for x86 in general, we need to be careful not to introduce further performance regressions on memory-heavy workloads. Even though initial results might reveal improvement on AMD Ryzen and EPYC hardware, we should also validate different configurations (e.g. Intel, different buffer sizes, etc) to make sure we won't hurt performance in other non-AMD environments. [Other Info] This issue has been fixed by the following upstream commit: - d3c57027470b (Reversing calculation of __x86_shared_non_temporal_threshold) $ git describe --contains d3c57027470b glibc-2.33~455 $ rmadison glibc -s focal,focal-updates,groovy,groovy-proposed,hirsute glibc \| 2.31-0ubuntu9 \| focal \| source glibc \| 2.31-0ubuntu9.2 \| focal-updates \| source glibc \| 2.32-0ubuntu3 \| groovy \| source glibc \| 2.32-0ubuntu3.2 \| groovy-proposed \| source glibc \| 2.33-0ubuntu5 \| hirsute \| source Affected releases include Ubuntu Focal and Groovy. Bionic is not affected, and releases starting with Hirsute already ship the upstream patch to fix this regression. glibc exports this specific variable as a tunable, so we could also tweak it with the GLIBC_TUNABLES env var: $ hyperfine -n clean-env 'lxc exec focal env ./test_memcpy64 32' -n tunables 'lxc exec focal env GLIBC_TUNABLES=glibc.cpu.x86_non_temporal_threshold=102410243*4 ./test_memcpy64 32' Benchmark #1: clean-env Time (mean ± σ): 2.529 s ± 0.061 s [User: 6.0 ms, System: 4.7 ms] Range (min … max): 2.457 s … 2.615 s 10 runs Benchmark #2: tunables Time (mean ± σ): 1.427 s ± 0.030 s [User: 6.5 ms, System: 3.8 ms] Range (min … max): 1.402 s … 1.482 s 10 runs Summary 'tunables' ran 1.77 ± 0.06 times faster than 'clean-env' This solution is not ideal, but it offers a secondary way of fixing the performance issues. However, the speed gains for memcpy() are noticeable enough that we should strongly consider changing the defaults in the Focal LTS release, so that it performs similarly to Bionic and future Ubuntu releases starting with Hirsute. [old test case section] Attached to this bug is a short C program that exercises memcpy() calls in buffers of variable length. This has been obtained from a similar bug report for Red Hat, and is publicly available at [0]. This test program was compiled with gcc 10.2.0, using the following flags: $ gcc -mtune=generic -march=x86_64 -g -03 test_memcpy.c -o test_memcpy64 Tests were performed with the following criteria: - use 32Mb buffers ("./test_memcpy64 32") - benchmark with the hyperfine tool [1], as it calculates relevant statistics automatically - benchmark with at least 10 runs in the same environment, to minimize variance - measure on AMD Zen (3700X) and on Intel Xeon (E5-2683), to ensure we don't penalize one x86 vendor in favor of the other Below is a comparison between two Focal containers, leveraging LXD to make use of different libc versions on the same host: $ hyperfine -n libc-2.31-0ubuntu9.2 'lxc exec focal ./test_memcpy64 32' -n libc-patched 'lxc exec focal-patched ./test_memcpy64 32' Benchmark #1: libc-2.31-0ubuntu9.2 Time (mean ± σ): 2.723 s ± 0.013 s [User: 4.7 ms, System: 5.1 ms] Range (min … max): 2.693 s … 2.735 s 10 runs Benchmark #2: libc-patched Time (mean ± σ): 1.522 s ± 0.004 s [User: 3.9 ms, System: 5.6 ms] Range (min … max): 1.515 s … 1.528 s 10 runs Summary 'libc-patched' ran 1.79 ± 0.01 times faster than 'libc-2.31-0ubuntu9.2' $ head -n5 /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 23 model : 113 model name : AMD Ryzen 7 3700X 8-Core Processor [0] https://bugzilla.redhat.com/show_bug.cgi?id=1880670 [1] https://github.com/sharkdp/hyperfine/
2021-12-13 06:03:27	Steve Langasek	glibc (Ubuntu Focal): status	In Progress	Fix Committed
2021-12-13 06:03:31	Steve Langasek	bug			added subscriber Ubuntu Stable Release Updates Team
2021-12-13 06:03:33	Steve Langasek	bug			added subscriber SRU Verification
2021-12-13 06:03:39	Steve Langasek	tags	sts	sts verification-needed verification-needed-focal
2021-12-15 17:25:24	Heitor Alves de Siqueira	tags	sts verification-needed verification-needed-focal	sts verification-done verification-done-focal
2022-02-10 21:49:06	Brian Murray	tags	sts verification-done verification-done-focal	sts verification-needed verification-needed-focal
2022-03-23 18:42:41	Simon Déziel	bug			added subscriber Simon Déziel
2022-04-26 13:56:20	Heitor Alves de Siqueira	tags	sts verification-needed verification-needed-focal	sts verification-done verification-done-focal
2022-05-11 01:44:50	Chris Halse Rogers	removed subscriber Ubuntu Stable Release Updates Team
2022-05-11 01:47:31	Launchpad Janitor	glibc (Ubuntu Focal): status	Fix Committed	Fix Released

Ubuntuglibc package

Activity log for bug #1928508

Ubuntu
glibc package