Terrible memcpy performance on Zen 3 when using rep movsb

Bug #2030515 reported by Bruce Merry
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GLibC
New
Low
glibc (Ubuntu)
New
Undecided
Unassigned

Bug Description

On CPUs that advertise FSRM (fast short rep movsb), glibc 2.35 uses REP MOVSB for memcpy for sizes above 2112 (up to some threshold that depends on the cache size). Unfortunately, it seems that Zen 3 (at least in the microcode we're running) is extremely slow at REP MOVSB when the data are not well-aligned.

I've found this using a memcpy benchmark at https://github.com/ska-sa/katgpucbf/blob/69752be58fb8ab0668ada806e0fd809e782cc58b/scratch/memcpy_loop.cpp (compiled with the adjacent Makefile). To demonstrate the issue, run

./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

This runs:
- 2113-byte memory copies
- 1,000,000 times per timing measurement
- in memory allocated with mmap
- with the source 0 bytes from the start of the page
- with the destination 1 byte from the start of the page
- on core 0.

It reports about 3.2 GB/s. Change the -b argument to 2111 and it reports over 100 GB/s. So the REP MOVSB case is about 30× slower!

This will most likely need to be reported and fixed upstream, but I'm reporting it to Ubuntu first since I don't know if Ubuntu has modified glibc in any way that would be significant.

See also: https://xuanwo.io/2023/04-rust-std-fs-slower-than-python/

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: libc6 2.35-0ubuntu3.1
ProcVersionSignature: Ubuntu 5.19.0-46.47~22.04.1-generic 5.19.17
Uname: Linux 5.19.0-46-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
Date: Mon Aug 7 14:02:28 2023
RebootRequiredPkgs: Error: path contained symlinks.
SourcePackage: glibc
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Bruce Merry (bmerry) wrote :
summary: - Terribly memcpy performance on Zen 3 when using rep movsb
+ Terrible memcpy performance on Zen 3 when using rep movsb
Revision history for this message
Florian Weimer (fweimer) wrote :

Does your glibc already have ld.so --list-diagnostics? It would be good to know the x86.cpu_features.shared_cache_size and x86.cpu_features.non_temporal_threshold values.

I suspect you are on a machine with a large core count, so the heuristic tunes things down to a low non-temporal threshold.

Revision history for this message
Bruce Merry (bmerry) wrote :

It's a 16-core processor. I've attached the diagnostics.

If I run

GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=2114 ./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0

then I hit the high performance again (>100GB/s).

Revision history for this message
Bruce Merry (bmerry) wrote :

For what it's worth, I've also tried upgrading the microcode to the latest version from linux-firmware (0xa0011d1) and it made no difference.

Revision history for this message
Bruce Merry (bmerry) wrote :

This also affects Zen 4 (Genoa): 4 GB/s when using 2113 byte copies (which uses REP MOVSB), 115 GB/s when using 2112 byte copies (which does not).

This is with default BIOS settings. It looks like the BIOS on this machine allows the ERMS and FSRM CPUID flags to be turned off.

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

When (dst-src)&0xFFF is small (but non-zero), the REP MOVSB path in memcpy performs extremely poorly (as much as 25x slower than the alternative path). I'm observing this on Zen 4 (Epyc 9374F). I'm running Ubuntu 22.04 with a glibc hand-built from glibc-2.38.9000-185-g2aa0974d25.

To reproduce:
1. Download the microbench at https://github.com/ska-sa/katgpucbf/blob/6176ed2e1f5eccf7f2acc97e4779141ac794cc01/scratch/memcpy_loop.cpp
2. Compile it with the adjacent Makefile (tl;dr: g++ -std=c++17 -O3 -pthread -o memcpy_loop memcpy_loop.cpp)
3. Run ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5
4. Run GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=10000 ./memcpy_loop -t mmap -f memcpy -b 8192 -p 100000 -D 1 -r 5

Step 3 reports a rate of 4.2 GB/s, while step 4 (which disables the rep_movsb path) reports a rate of 111 GB/s. The test uses 8192-byte memory copies, where the source is page-aligned and the destination starts 1 byte into a page.

I'll also attach the bench-memcpy-large.out, which shows similar results.

I've previously filed this as an Ubuntu bug (https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515) but it doesn't seem to have received much attention.

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

Created attachment 15193
Glibc's memcpy benchmark results

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

Created attachment 15194
Output of ld-linux.so.2 --list-tunables

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

Created attachment 15195
Output of ld-linux.so.2 --list-diagnostics

Revision history for this message
Bruce Merry (bmerry) wrote :
Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

This issue also affects Zen 3. Zen 2 doesn't advertise ERMS so memcpy isn't affected.

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

FWIW, backwards REP MOVSB (std; rep movsb; cld) is still horribly slow on Zen 4 (4 GB/s even when the data is nicely aligned and cached).

Revision history for this message
In , Adhemerval Zanella (adhemerval-zanella) wrote :
Download full text (4.3 KiB)

I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions. ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
990.845 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
510.87 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
40.4088 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`

While with vectorized instructions I see:

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold...

Read more...

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

Here's what I get on the Zen 4 system with the same parameters. I haven't had a chance to look at what it all means:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1075.03 GB/s

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :
Download full text (3.3 KiB)

Ah looks like the GLIBC_TUNABLES environment variable didn't appear in the output. Let me try again:

+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
80.6649 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
954.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1883.1 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
48.7753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
570.385 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
676.928 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
3.54696 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
42.5706 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
85.0753 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
3.50689 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
41.5237 GB/s
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
81.8951 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 -r5
102.05 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
1206.81 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2415.47 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 -r5
49.4859 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 2 4 6 8 10 12 14 16 18 20 22 -r5
583.279 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
1066.54 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 -r5
97.1753 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
991.128 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 -r5
2257.42 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 -r5
49.3362 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000
+ ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 0 2 4 6 8 10 12 14 16 18 20 22 -r5
571.026 GB/s
+ GLIBC_TUNABLES=glibc.cpu.x86_rep...

Read more...

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

So in those cases, REP MOVSB seems to be a slow-down, but there do also seem to be cases where REP MOVSB is much faster (this is on Zen 4) e.g.

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
94.5295 GB/s
94.3382 GB/s
94.474 GB/s
94.2385 GB/s
94.5105 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
56.5062 GB/s
55.3669 GB/s
56.4723 GB/s
55.857 GB/s
56.5396 GB/s

When not using huge pages, the vectorised memcpy hits 115.5 GB/s. I'm seeing a lot of cases on Zen 4 where huge pages actually makes things worse; maybe it's related to hardware prefetch reading past the end of the buffer?

Revision history for this message
In , Adhemerval Zanella (adhemerval-zanella) wrote :

On Zen3 I am not seeing such slowdown using vectorized instructions. With a patch glibc to disable REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
146.593 GB/s

# Force REP MOVSB
$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_stop_threshold=4097 ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
116.298 GB/s

And I don't see difference between mmap and mmap_huge.

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

> On Zen3 I am not seeing such slowdown using vectorized instructions.

Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc handy on that server):

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
90.065 GB/s
89.9096 GB/s
89.9131 GB/s
89.8207 GB/s
89.952 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
116.997 GB/s
116.874 GB/s
116.937 GB/s
117.029 GB/s
117.007 GB/s

On the other hand, there seem to be other cases where REP MOVSB is faster on Zen 3:

$ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy_rep_movsb
22.045 GB/s
22.3135 GB/s
22.1144 GB/s
22.8571 GB/s
22.2688 GB/s

$ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
Using function memcpy
7.66155 GB/s
7.71314 GB/s
7.72952 GB/s
7.72505 GB/s
7.74309 GB/s

But overall it does seem like the vectorised copy performs better than REP MOVSB on Zen 3.

Revision history for this message
In , Adhemerval Zanella (adhemerval-zanella) wrote :

(In reply to Bruce Merry from comment #11)
> > On Zen3 I am not seeing such slowdown using vectorized instructions.
>
> Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers
> (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc
> handy on that server):
>
> $ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 90.065 GB/s
> 89.9096 GB/s
> 89.9131 GB/s
> 89.8207 GB/s
> 89.952 GB/s
>
> $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D
> 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 116.997 GB/s
> 116.874 GB/s
> 116.937 GB/s
> 117.029 GB/s
> 117.007 GB/s
>
> On the other hand, there seem to be other cases where REP MOVSB is faster on
> Zen 3:
>
> $ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy_rep_movsb
> 22.045 GB/s
> 22.3135 GB/s
> 22.1144 GB/s
> 22.8571 GB/s
> 22.2688 GB/s
>
> $ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy
> 7.66155 GB/s
> 7.71314 GB/s
> 7.72952 GB/s
> 7.72505 GB/s
> 7.74309 GB/s
>
> But overall it does seem like the vectorised copy performs better than REP
> MOVSB on Zen 3.

The main issues seem to define when ERMS is better than vectorized based on arguments. Current glibc only takes into consideration the input size, whereas from the discussion it seems we need to also take into consideration the argument alignment (and both of them).

Also, it seems that Zen3 ERMS is slightly better than non-temporal instructions, which is another tuning heuristics since again only the size is used where to use it (currently x86_non_temporal_threshold).

In any case, I think at least for sizes where ERMS is currently being used it would be better to use the vectorized path. Most likely some more tunings to switch to ERMS on large sizes would be profitable for Zen cores.

Does AMD provide any tuning manual describing such characteristics for instruction and memory operations?

Benjamin Drung (bdrung)
description: updated
Changed in glibc:
importance: Unknown → Low
status: Unknown → New
Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The master branch has been updated by H.J. Lu <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e

commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e
Author: Adhemerval Zanella <email address hidden>
Date: Thu Feb 8 10:08:38 2024 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)

    The REP MOVSB usage on memcpy/memmove does not show much performance
    improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
    as from BZ 30994, if the source is aligned and the destination is not
    the performance can be 20x slower.

    The performance difference is noticeable with small buffer sizes, closer
    to the lower bounds limits when memcpy/memmove starts to use ERMS. The
    performance of REP MOVSB is similar to vectorized instruction on the
    size limit (the L2 cache). Also, there is no drawback to multiple cores
    sharing the cache.

    Checked on x86_64-linux-gnu on Zen3.
    Reviewed-by: H.J. Lu <email address hidden>

Revision history for this message
In , Cvs-commit (cvs-commit) wrote :

The release/2.39/master branch has been updated by Arjun Shankar <email address hidden>:

https://sourceware.org/git/gitweb.cgi?p=glibc.git;h=aa4249266e9906c4bc833e4847f4d8feef59504f

commit aa4249266e9906c4bc833e4847f4d8feef59504f
Author: Adhemerval Zanella <email address hidden>
Date: Thu Feb 8 10:08:38 2024 -0300

    x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)

    The REP MOVSB usage on memcpy/memmove does not show much performance
    improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
    as from BZ 30994, if the source is aligned and the destination is not
    the performance can be 20x slower.

    The performance difference is noticeable with small buffer sizes, closer
    to the lower bounds limits when memcpy/memmove starts to use ERMS. The
    performance of REP MOVSB is similar to vectorized instruction on the
    size limit (the L2 cache). Also, there is no drawback to multiple cores
    sharing the cache.

    Checked on x86_64-linux-gnu on Zen3.
    Reviewed-by: H.J. Lu <email address hidden>

    (cherry picked from commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.