I have access to a Zen3 code (5900X) and I can confirm that using REP MOVSB seems to be always worse than vector instructions.  ERMS is used for sizes between 2112 (rep_movsb_threshold) and 524288 (rep_movsb_stop_threshold or the L2 size for Zen3) and the '-S 0 -D 1' performance really seems to be a microcode since I don't see similar performance difference with other alignments.

On Zen3 with REP MOVSB I see:

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
84.2448 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
506.099 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
990.845 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
57.1122 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
325.409 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
510.87 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
4.43104 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.4551 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
40.4088 GB/s


$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
4.34671 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
22.0829 GB/s

$ ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`


While with vectorized instructions I see:


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0
124.183 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 2 23`
773.696 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 0 `seq -s' ' 0 23`
1413.02 GB/s


$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0
58.3212 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 2 23`
322.583 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 0 `seq -s' ' 0 23`
506.116 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15
121.872 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 2 23`
717.717 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 2113 -p 100000 -D 15 `seq -s' ' 0 23`
1318.17 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15
58.5352 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 2 23`
325.996 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./testrun.sh ./memcpy_loop -t mmap -f memcpy -b 524287 -p 100000 -D 15 `seq -s' ' 0 23`
498.552 GB/s

So it seems there in gain in using REP MOVSB on Zen3/Zen4, specially on the size is was supposed to be better. glibc 2.34 added a fix from AMD (6e02b3e9327b7dbb063958d2b124b64fcb4bbe3f), where the assumption is ERMS performs poorly on data above L2 cache size so REP MOVSB is limited to L2 cache size (from 2113 to 524287), but I think AMD engineers did not really evaluated that ERM is indeed better than vectorized instruction.

And I think BZ#30995 is the same issue, since __memcpy_avx512_unaligned_erms uses the same tunable to decide whether to use ERMS. I have created a patch that just disable ERMS usage on AMD cores [1], can you check if it improves performance on Zen4 as well?

Also, I have notices that memset is also showing subpar performance with ERMS and I also disable it on my branch.

[1] https://sourceware.org/git/?p=glibc.git;a=shortlog;h=refs/heads/azanella/bz30944-memcpy-zen