Bug #2030515 “Terrible memcpy performance on Zen 3 when using re...” : Bugs : glibc package : Ubuntu

Revision history for this message

Bruce Merry (bmerry) wrote on 2023-08-07:

#1

Dependencies.txt Edit (152 bytes, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.6 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (306 bytes, text/plain; charset="utf-8")

summary:

- Terribly memcpy performance on Zen 3 when using rep movsb
+ Terrible memcpy performance on Zen 3 when using rep movsb

Revision history for this message

Florian Weimer (fweimer) wrote on 2023-08-07:

#2

Does your glibc already have ld.so --list-diagnostics? It would be good to know the x86.cpu_features.shared_cache_size and x86.cpu_features.non_temporal_threshold values.

I suspect you are on a machine with a large core count, so the heuristic tunes things down to a low non-temporal threshold.

Revision history for this message

Bruce Merry (bmerry) wrote on 2023-08-07:

#3

Output of /lib64/ld-linux-x86-64.so.2 --list-diagnostics Edit (8.6 KiB, text/plain)

It's a 16-core processor. I've attached the diagnostics.

If I run

GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=2114 ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

then I hit the high performance again (>100GB/s).

Revision history for this message

Bruce Merry (bmerry) wrote on 2023-08-14:

#4

For what it's worth, I've also tried upgrading the microcode to the latest version from linux-firmware (0xa0011d1) and it made no difference.

Revision history for this message

Bruce Merry (bmerry) wrote on 2023-10-09:

#5

This also affects Zen 4 (Genoa): 4 GB/s when using 2113 byte copies (which uses REP MOVSB), 115 GB/s when using 2112 byte copies (which does not).

This is with default BIOS settings. It looks like the BIOS on this machine allows the ERMS and FSRM CPUID flags to be turned off.

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-24:

#7

When (dst-src)&0xFFF is small (but non-zero), the REP MOVSB path in memcpy performs extremely poorly (as much as 25x slower than the alternative path). I'm observing this on Zen 4 (Epyc 9374F). I'm running Ubuntu 22.04 with a glibc hand-built from glibc-2.38.9000-185-g2aa0974d25.

To reproduce:
1. Download the microbench at https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
2. Compile it with the adjacent Makefile (tl;dr: g++ -std=c++17 -O3 -pthread -o memcpy_loop memcpy_loop.cpp)
3. Run ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5
4. Run GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=10000 ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5

Step 3 reports a rate of 4.2 GB/s, while step 4 (which disables the rep_movsb path) reports a rate of 111 GB/s. The test uses 8192-byte memory copies, where the source is page-aligned and the destination starts 1 byte into a page.

I'll also attach the bench-memcpy-large.out, which shows similar results.

I've previously filed this as an Ubuntu bug (https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515) but it doesn't seem to have received much attention.

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-24:

#8

Created attachment 15193
Glibc's memcpy benchmark results

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-24:

#9

Created attachment 15194
Output of ld-linux.so.2 --list-tunables

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-24:

#10

Created attachment 15195
Output of ld-linux.so.2 --list-diagnostics

Revision history for this message

Bruce Merry (bmerry) wrote on 2023-10-24:

#6

I've raised this upstream as https://sourceware.org/bugzilla/show_bug.cgi?id=30994

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-24:

#11

This issue also affects Zen 3. Zen 2 doesn't advertise ERMS so memcpy isn't affected.

Revision history for this message

In Sourceware.org Bugzilla #30994, Bruce Merry (bmerry-q) wrote on 2023-10-25:

#12

FWIW, backwards REP MOVSB (std; rep movsb; cld) is still horribly slow on Zen 4 (4 GB/s even when the data is nicely aligned and cached).

Revision history for this message

In Sourceware.org Bugzilla #30994, Adhemerval Zanella (adhemerval-zanella) wrote on 2023-10-27:

#13

Download full text (4.3 KiB)

I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions. ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
990.845 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
510.87 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
40.4088 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`

While with vectorized instructions I see:

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold...

I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions.  ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments.