On Zen3 I am not seeing such slowdown using vectorized instructions. With a patch glibc to disable REP MOVSB I see:
$ ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 146.593 GB/s
# Force REP MOVSB $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_stop_threshold=4097 ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 116.298 GB/s
And I don't see difference between mmap and mmap_huge.
On Zen3 I am not seeing such slowdown using vectorized instructions. With a patch glibc to disable REP MOVSB I see:
$ ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
146.593 GB/s
# Force REP MOVSB glibc.cpu. x86_rep_ movsb_stop_ threshold= 4097 ./testrun.sh ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000
$ GLIBC_TUNABLES=
116.298 GB/s
And I don't see difference between mmap and mmap_huge.