Comment 19 for bug 2030515

Revision history for this message
In , Adhemerval Zanella (adhemerval-zanella) wrote :

(In reply to Bruce Merry from comment #11)
> > On Zen3 I am not seeing such slowdown using vectorized instructions.
>
> Agreed, I'm also not seeing this huge-page slowdown on our Zen 3 servers
> (this is with Ubuntu 22.04's glibc 2.32; I haven't got a hand-built glibc
> handy on that server):
>
> $ ./memcpy_loop -D 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 90.065 GB/s
> 89.9096 GB/s
> 89.9131 GB/s
> 89.8207 GB/s
> 89.952 GB/s
>
> $ GLIBC_TUNABLES=glibc.cpu.x86_rep_movsb_threshold=1000000 ./memcpy_loop -D
> 512 -b 4096 -t mmap_huge -f memcpy -p 10000000 -r 5 0
> Using 1 threads, each with 4096 bytes of mmap_huge memory (10000000 passes)
> Using function memcpy
> 116.997 GB/s
> 116.874 GB/s
> 116.937 GB/s
> 117.029 GB/s
> 117.007 GB/s
>
> On the other hand, there seem to be other cases where REP MOVSB is faster on
> Zen 3:
>
> $ ./memcpy_loop -D 512 -f memcpy_rep_movsb -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy_rep_movsb
> 22.045 GB/s
> 22.3135 GB/s
> 22.1144 GB/s
> 22.8571 GB/s
> 22.2688 GB/s
>
> $ ./memcpy_loop -D 512 -f memcpy -r 5 -t mmap 0
> Using 1 threads, each with 134217728 bytes of mmap memory (10 passes)
> Using function memcpy
> 7.66155 GB/s
> 7.71314 GB/s
> 7.72952 GB/s
> 7.72505 GB/s
> 7.74309 GB/s
>
> But overall it does seem like the vectorised copy performs better than REP
> MOVSB on Zen 3.

The main issues seem to define when ERMS is better than vectorized based on arguments. Current glibc only takes into consideration the input size, whereas from the discussion it seems we need to also take into consideration the argument alignment (and both of them).

Also, it seems that Zen3 ERMS is slightly better than non-temporal instructions, which is another tuning heuristics since again only the size is used where to use it (currently x86_non_temporal_threshold).

In any case, I think at least for sizes where ERMS is currently being used it would be better to use the vectorized path. Most likely some more tunings to switch to ERMS on large sizes would be profitable for Zen cores.

Does AMD provide any tuning manual describing such characteristics for instruction and memory operations?