The REP MOVSB usage on memcpy/memmove does not show much performance
improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
as from BZ 30994, if the source is aligned and the destination is not
the performance can be 20x slower.
The performance difference is noticeable with small buffer sizes, closer
to the lower bounds limits when memcpy/memmove starts to use ERMS. The
performance of REP MOVSB is similar to vectorized instruction on the
size limit (the L2 cache). Also, there is no drawback to multiple cores
sharing the cache.
Checked on x86_64-linux-gnu on Zen3.
Reviewed-by: H.J. Lu <email address hidden>
(cherry picked from commit 0c0d39fe4aeb0f69b26e76337c5dfd5530d5d44e)
The release/2.39/master branch has been updated by Arjun Shankar <email address hidden>:
https:/ /sourceware. org/git/ gitweb. cgi?p=glibc. git;h=aa4249266 e9906c4bc833e48 47f4d8feef59504 f
commit aa4249266e9906c 4bc833e4847f4d8 feef59504f
Author: Adhemerval Zanella <email address hidden>
Date: Thu Feb 8 10:08:38 2024 -0300
x86: Fix Zen3/Zen4 ERMS selection (BZ 30994)
The REP MOVSB usage on memcpy/memmove does not show much performance
improvement on Zen3/Zen4 cores compared to the vectorized loops. Also,
as from BZ 30994, if the source is aligned and the destination is not
the performance can be 20x slower.
The performance difference is noticeable with small buffer sizes, closer
to the lower bounds limits when memcpy/memmove starts to use ERMS. The
performance of REP MOVSB is similar to vectorized instruction on the
size limit (the L2 cache). Also, there is no drawback to multiple cores
sharing the cache.
Checked on x86_64-linux-gnu on Zen3.
Reviewed-by: H.J. Lu <email address hidden>
(cherry picked from commit 0c0d39fe4aeb0f6 9b26e76337c5dfd 5530d5d44e)