Performance regression with memcpy on Intel CPU
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| glibc (Ubuntu) |
Expired
|
Undecided
|
Unassigned | ||
Bug Description
# lsb_release -rd
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Reporting a performance regression in libc6-dev=
Regression was observed on Intel Xeon(R) Gold 6248 CPU @ 2.50GHz (Cascade Lake)
We're seeing a 3x slowdown on e.g. the following tiny program and similar slowdowns on important workloads:
```
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
int main(void) {
size_t SIZE = (1 << 20);
char *src = malloc(SIZE);
char *dst = malloc(SIZE);
for(int i = 0; i < (SIZE); ++i) {
src[i] = rand() % 256;
dst[i] = rand() % 256;
}
clock_t start = clock();
for(int i = 0; i < 10000; ++i) {
memcpy(dst, src, SIZE);
}
clock_t end = clock();
printf("%f\n", (double) (end - start)/
}
```
Probably due to changes resulting from https:/

Thanks for the report, Shantanu!
Have you confirmed whether this is indeed related to the changes from bug 1928508? I've looked into upstream changes to __x86_shared_ non_temporal_ threshold, and there were no fixes or regression reports after the ones we've backported to Ubuntu Focal. At the time this change was introduced, no regressions in other platforms have been reported upstream or in Ubuntu, so I wonder if we missed your test case.
Would you be able to double-check whether that patch is responsible? Have you seen different performance behavior in recent glibc versions, or other distros with the same glibc version? One could also use different tunable values for __x86_shared_ non_temporal_ threshold like below:
$ GLIBC_TUNABLES= glibc.cpu. x86_non_ temporal_ threshold= 1024*1024* 3*4