lrand48() doesn't scale well on highly concurrent platforms

Bug #1412488 reported by Axel Schwenke on 2015-01-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
sysbench
Undecided
Unassigned

Bug Description

When doing benchmarks on the power8 platform which offers as much as 160 concurrent threads (20 physical cpu cores * 8 threads per core) we've seen sysbench spending huge amounts of cpu time in lrand48(). The effect was visible as decreased performance when doing OLTP read/write benchmarks at high concurrency. This is partly an effect of sysbench itself eating cpu, partly a effect of cache line pollution by the shared global RNG state.

Attached is a patch that changes the sb_rnd() function to use a thread local RNG. It's a straight forward implementation of a LCG with 32 bit length. More specifically I used the first parameter set from https://en.wikipedia.org/wiki/Linear_congruential_generator. The implementation uses autoconf to check for thread-local storage. if the build environment doesn't provide this, it falls back to the old behavior.

With the patch applied the peak throughput went up by 20% from 20K tps to 24K tps. I'd like to see this change in sysbench trunk soon.

Axel Schwenke (ahel) wrote :
Alexey Kopytov (akopytov) wrote :

I'd like to keep the current RNG and make sb_rnd_local() usage optional and disabled by default.

Also, the patch uses SB_MAX_RND (0x3fffffff) as the modulus, while the chosen values of multiplier and increment correspond to the modules value of 2^32. Which means condition number #2 from https://en.wikipedia.org/wiki/Linear_congruential_generator#Period_length does not hold, and thus the chosen constants may have a negative impact on RNG quality.

Axel Schwenke (ahel) wrote :

> the patch uses SB_MAX_RND (0x3fffffff) as the modulus, while
> the chosen values of multiplier and increment correspond to
> the modules value of 2^32. Which means condition number #2 from
> https://en.wikipedia.org/wiki/Linear_congruential_generator#Period_length
> does not hold, and thus the chosen constants may have a negative
> impact on RNG quality.

I think this is not a valid point for two reasons:

1. SB_MAX_RND is *not* the modulus of the LCG. The modulus is 2^32, but output of the LCG is projected to a smaller range by the (seed % SB_MAX_RND) operation. There is no way that this could affect the resulting period length to become shorter than SB_MAX_RND itself.

2. the current implementation does the same thing. In sysbench.h there is

#define SB_MAX_RND 0x3fffffffu
...
#ifdef HAVE_LRAND48
# define sb_rnd() (lrand48() % SB_MAX_RND)
...

meaning the 31 random bits returned from lrand48() are projected to a smaller range.

Further more I think the modulus operation above is wrong. If we want to project a random number from a larger range inside the interval [0..SB_MAX_RND] (limits included) then we should use (lrand48() % (SB_MAX_RND+1)). Which is also more efficient because it can be implemented as (rand_input & SB_MAX_RND), thus avoiding a division.

I didn't read all the code, but do you remember why there is a fixed value of SB_MAX_RND at all? The only place in the code where I see it being used is when the result from sb_rnd() is transformed to a DOUBLE value in [0..1]. Wouldn't it be much better to chose SB_MAX_RND according to the RNG that is picked by autoconf? Then SB_MAX_RND would be 2^32-1 for lrand48() (which return 31 bits) and 2^33-1 for sb_rnd_local() (which returns 32 bits).

> 1. SB_MAX_RND is *not* the modulus of the LCG. The modulus is 2^32, but
> output of the LCG is projected to a smaller range by the (seed %
> SB_MAX_RND) operation. There is no way that this could affect the
> resulting period length to become shorter than SB_MAX_RND itself.
>

Right, I missed that the result of (seed % SB_RND_MAX) is not assigned
back to seed.

Anyway, I don’t want to change the default RNG behavior. All changes
should be optional and off by default.

Sergey Vojtovich (svoj) wrote :

Just for the record, there is another option to attack this problem: reduce number of lrand48() calls. 31 random bits is enough to generate 9 decimals using division (possibly slower), or 7 decimals using bitwise shift (possibly faster).

Calling lrand48() 7x or 9x times less per transaction should solve this problem, at least for given core/thread count.

Alexey Kopytov (akopytov) wrote :

Fixed based on the same idea has been pushed to the 1.0 branch: https://github.com/akopytov/sysbench/commit/e24a028409da3dc892f026a20ac371ecf8e2ef42

Thank you!

Changed in sysbench:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers