lrand48() doesn't scale well on highly concurrent platforms
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
sysbench |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
When doing benchmarks on the power8 platform which offers as much as 160 concurrent threads (20 physical cpu cores * 8 threads per core) we've seen sysbench spending huge amounts of cpu time in lrand48(). The effect was visible as decreased performance when doing OLTP read/write benchmarks at high concurrency. This is partly an effect of sysbench itself eating cpu, partly a effect of cache line pollution by the shared global RNG state.
Attached is a patch that changes the sb_rnd() function to use a thread local RNG. It's a straight forward implementation of a LCG with 32 bit length. More specifically I used the first parameter set from https:/
With the patch applied the peak throughput went up by 20% from 20K tps to 24K tps. I'd like to see this change in sysbench trunk soon.
I'd like to keep the current RNG and make sb_rnd_local() usage optional and disabled by default.
Also, the patch uses SB_MAX_RND (0x3fffffff) as the modulus, while the chosen values of multiplier and increment correspond to the modules value of 2^32. Which means condition number #2 from https:/ /en.wikipedia. org/wiki/ Linear_ congruential_ generator# Period_ length does not hold, and thus the chosen constants may have a negative impact on RNG quality.