Comment 16 for bug 2030515

Revision history for this message
In , Bruce Merry (bmerry-q) wrote :

So in those cases, REP MOVSB seems to be a slow-down, but there do also seem to be cases where REP MOVSB is much faster (this is on Zen 4) e.g.

$ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
94.5295 GB/s
94.3382 GB/s
94.474 GB/s
94.2385 GB/s
94.5105 GB/s

$ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
Using function memcpy
56.5062 GB/s
55.3669 GB/s
56.4723 GB/s
55.857 GB/s
56.5396 GB/s

When not using huge pages, the vectorised memcpy hits 115.5 GB/s. I'm seeing a lot of cases on Zen 4 where huge pages actually makes things worse; maybe it's related to hardware prefetch reading past the end of the buffer?