Terrible memcpy performance on Zen 3 when using rep movsb
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
GLibC |
New
|
Low
|
|||
glibc (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
On CPUs that advertise FSRM (fast short rep movsb), glibc 2.35 uses REP MOVSB for memcpy for sizes above 2112 (up to some threshold that depends on the cache size). Unfortunately, it seems that Zen 3 (at least in the microcode we're running) is extremely slow at REP MOVSB when the data are not well-aligned.
I've found this using a memcpy benchmark at https:/
./memcpy_loop -b 2113 -p 1000000 -t mmap -S 0 -D 1 0
This runs:
- 2113-byte memory copies
- 1,000,000 times per timing measurement
- in memory allocated with mmap
- with the source 0 bytes from the start of the page
- with the destination 1 byte from the start of the page
- on core 0.
It reports about 3.2 GB/s. Change the -b argument to 2111 and it reports over 100 GB/s. So the REP MOVSB case is about 30× slower!
This will most likely need to be reported and fixed upstream, but I'm reporting it to Ubuntu first since I don't know if Ubuntu has modified glibc in any way that would be significant.
See also: https:/
ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: libc6 2.35-0ubuntu3.1
ProcVersionSign
Uname: Linux 5.19.0-46-generic x86_64
NonfreeKernelMo
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckR
Date: Mon Aug 7 14:02:28 2023
RebootRequiredPkgs: Error: path contained symlinks.
SourcePackage: glibc
UpgradeStatus: No upgrade log present (probably fresh install)
description: | updated |
Changed in glibc: | |
importance: | Unknown → Low |
status: | Unknown → New |
Does your glibc already have ld.so --list-diagnostics? It would be good to know the x86.cpu_ features. shared_ cache_size and x86.cpu_ features. non_temporal_ threshold values.
I suspect you are on a machine with a large core count, so the heuristic tunes things down to a low non-temporal threshold.