Bug #909774 “Thread pool: new threads are created too slowly eve...” : Bugs : MariaDB

Elena Stepanova (elenst) on 2011-12-29

description:

updated

Elena Stepanova (elenst) on 2011-12-29

summary:

- Thread pool: new threads are created too slowly even with maximum tuning
+ Thread pool: new threads are created too slowly even with tuning

Revision history for this message

Vladislav Vaintroub (wlad-montyprogram) wrote on 2011-12-29:

#1

Please take a look onthe description how thread scheduling is done in SQL Server. The best description is perhaps in Ken Hendersons "internals" book, but there is some information spread around on the internet too (google for SQLServer, ums, scheduling).
For example, this one
http://www.vineetgupta.com/2005/11/sql-server-non-preemptive-scheduling-ums-clr-and-fiber-mode/
also describes how SQLServer schedules its tasks, whereby not in great level of details.

It is important to understand, zhat unlike OS scheduler which is preemptive, every pooling scheduler is non-preemptive (more or less, we allow some level of preemption). The objective of preemptive scheduler is giving every task a chance to execute. The objective of non-preemptive scheduler is maximizing performance. Quoting from the article above "job in life is to ensure that there is one unblocked thread executing on that CPU at any given point of time, and all other threads are blocked".

If we followed this rule we would never create an additional thread if there is already an active thread in the group. We do that however, even if it is at the speed which you do not like. We can reduce the stall limit, and we can probably talk about an additional QoS feature "max_queuing_time" parameter (which is I think what you ultimately would like to have - limiting queueing time for a request). But before that I would like to ensure the general objectives and techniques of thread pooling are understood and agreed upon.

Having said this, the example looks like specifically crafted thread pooling anti-pattern. It assumes never ending avalanche of long non-yielding CPU-intensive queries, and it assumes a rather questionably designed scheduled job (there is a better job scheduler in MySQL for such things btw:) . For such things I guess thread-per-connection would work better (and, even if you have threadpooling in MariaDB, you can have a separate thread-per-connection scheduler too, so you may decide according to your needs).

Please take a look onthe description how thread scheduling is done in SQL Server. The best description is perhaps in  Ken Hendersons "internals" book, but there is some information spread around on the internet too (google for SQLServer, ums, scheduling). 
For example, this one
http://www.vineetgupta.com/2005/11/sql-server-non-preemptive-scheduling-ums-clr-and-fiber-mode/ 
also describes  how SQLServer schedules its tasks, whereby not in great level of details.

It is important to understand, zhat unlike OS scheduler which is preemptive, every pooling scheduler is non-preemptive (more or less, we allow some level of preemption). The objective of preemptive scheduler is giving every task a chance to execute. The objective of non-preemptive scheduler is maximizing performance. Quoting from the article above "job in life is to ensure that there is one unblocked thread executing on that CPU at any given point of time, and all other threads are blocked".

If we followed this rule we would never create an additional thread if there is already an active thread in the group. We do that however, even if it is at the speed which you do not like.  We can reduce the stall limit, and we can probably talk about an additional QoS feature  "max_queuing_time" parameter (which is I think what you ultimately would like to have - limiting queueing time for a request).  But before that I would like to ensure the general objectives and techniques of thread pooling are understood and agreed upon.

Having said this, the example looks like specifically crafted  thread pooling anti-pattern.  It assumes never ending avalanche of long non-yielding CPU-intensive queries, and it assumes a rather questionably designed scheduled job (there is a better job scheduler in MySQL for such things btw:)  . For such things I guess thread-per-connection would work better (and, even if you have threadpooling in MariaDB, you can have a separate thread-per-connection scheduler too, so you may decide according to your needs).

Revision history for this message

Vladislav Vaintroub (wlad-montyprogram) wrote on 2012-01-02:

#2

Ok, I reduced thread_pool_stall_limit minimum to 10 milliseconds. Also introduced thread_pool_oversubscribe parameter to fine tune how many tasks a group can run at the same time (before the change max. parallel tasks was hardcoded to 4).

So now one can reduce stall limit and increase thread_pool_oversubscribe to get somewhat more "thread-per-connection"ish behavior in threadpool and more eager creation . But still, queuing of tasks will happen, since it is what thread pools do.

Changed in maria:
status:	New → Fix Committed

Revision history for this message

Elena Stepanova (elenst) wrote on 2012-01-02:

#3

>> the example looks like specifically crafted thread pooling anti-pattern

The initial description consisted of two parts. The first part was to show that even with the minimal deviation from the perfect flow the impact of the new (presumably default) configuration can be noticed by a live user. The second case was driven from it to some extreme to show that it can lead not only to performance problems, but to the loss of functionality. I agree that the second part might be not the best example, it was just a fast one. I think such patterns might exist in real life because they are easy to create, but will probably be rare; so lets ignore it.

I find the initial scenario in itself worrisome.

The article above bases the "job in life is to ensure that there is one unblocked thread executing..." statement on a somewhat arguable assumption: "since all threads are SQL spawned, they are "well-behaved" and include code that prevents them from monopolizing the system". I don't know if it's true for the SQL server in question, but in our case long non-yielding queries do happen. It's quite normal to expect that these queries might suffer some performance loss; but the first example shows that in fact _other_ queries are affected. Even with only two long queries running at the same time, the delay for unrelated simple short queries is percievable; and with 5-10 long queries, the delay for others might be seriously annoying.

If we let it be the default behavior, what we are likely to observe is that after an upgrade users will start complaining that "every now and then a simple query might hang for 5-10 seconds". Thinking about widespread real-life setups (web applications, virtual hosting, etc.), in many cases the schema owner might have no way whatsoever to avoid or even investigate that, since "bad" long queries might be happening in a different schema on the same server. Since long queries don't necessarily cause general system overload, monitors will show nothing suspicious, so the hoster admins will have a problem investigating it too, and the conclusion is likely to be "the server is just slow at times". It is a bad reputation that spreads fast and is hard to counterweigh with nice benchmark results.

I will try the fix to see how it works now, but in general my opinion is that it makes sense to disable the new behavior by default. People who really care about performance on the level of switching contexts don't run their servers with default parameters anyway -- they do fine-tuning. If they set thread pooling in their configuration manually, they will at least know what they changed if something goes wrong; while the users who only care whether their queries run 1 second or 5 won't get a new problem.

>> the example looks like specifically crafted thread pooling anti-pattern

The initial description consisted of two parts. The first part was to show that even with the minimal deviation from the perfect flow the impact of the new (presumably default) configuration can be noticed by a live user. The second case was driven from it to some extreme to show that it can lead not only to performance problems, but to the loss of functionality. I agree that the second part might be not the best example, it was just a fast one. I think such patterns might exist in real life because they are easy to create, but will probably be rare; so lets ignore it.

I find the initial scenario in itself worrisome.

The article above bases the "job in life is to ensure that there is one unblocked thread executing..." statement on a somewhat arguable assumption: "since all threads are SQL spawned, they are "well-behaved" and include code that prevents them from monopolizing the system". I don't know if it's true for the SQL server in question, but in our case long non-yielding queries do happen. It's quite normal to expect that these queries might suffer some performance loss; but the first example shows that in fact _other_ queries are affected. Even with only two long queries running at the same time, the delay for unrelated simple short queries is percievable; and with 5-10 long queries, the delay for others might be seriously annoying.

If we let it be the default behavior, what we are likely to observe is that after an upgrade users will start complaining that "every now and then a simple query might hang for 5-10 seconds". Thinking about widespread real-life setups (web applications, virtual hosting, etc.), in many cases the schema owner might have no way whatsoever to avoid or even investigate that, since "bad" long queries might be happening in a different schema on the same server. Since long queries don't necessarily cause general system overload, monitors will show nothing suspicious, so the hoster admins will have a problem investigating it too, and the conclusion is likely to be "the server is just slow at times". It is a bad reputation that spreads fast and is hard to counterweigh with nice benchmark results.

I will try the fix to see how it works now, but in general my opinion is that it makes sense to disable the new behavior by default. People who really care about performance on the level of switching contexts don't run their servers with default parameters anyway -- they do fine-tuning. If they set thread pooling in their configuration manually, they will at least know what they changed if something goes wrong; while the users who only care whether their queries run 1 second or 5 won't get a new problem.

Revision history for this message

Vladislav Vaintroub (wlad-montyprogram) wrote on 2012-01-02:

#4

>>It's quite normal to expect that these queries might suffer some performance loss; but the first example shows that in fact >>_other_ queries are affected.Even with only two long queries running at the same time, the delay for unrelated simple short >>queries is percievable; and with 5-10 long queries, the delay for others might be seriously annoying.

Well, I said "specially crafted anti-pattern". I meant by it, that to create something like your example, one has to reduce the thread pool size to 1, one has to run queries that never yield, and that run long, one has to start all long queries at the same time, and one has to measure the response time of only the very first "normally fast " query.

Reducing thread-pool-size to 1 (a computer with single processor is now quite hard to find in the wild) increases the probability of several long non-yielding queries in the same group, and increases it by a quite a large factor. Starting all such long queries at the very same time artifically increases the queue size. And puttng single dummy query at the very end of the queue is used as evidence that such queries would generally run longer in presence of long non-yielding queries. What is actually measured by this workload, is how long is the rampup when environment changes from absolutely idle to a flood of lot of "bad" long queries coming from mlutiple clients simultenously.

So if I would come up with anti-threadpool-pattern workload, this will be this: an environment, in which many different clients all suddenly start to issue simultenously long non-yielding queries, and then disconnect/sleep for long time (so that idle threads are removed again) . To finish the picture, one needs to throw couple of clients with short queries to the mix, and they need to run at the same time and the importance of the short queries to complete in short timeframe needs to be rather high.

This is pretty much your test case, though yours only had a single "burst". I would not classify that pattern as something common, I actually cannot come up with a non-artificial example of it.

IF your concern is about whether or not threadpool will be default, it is not up to me to decide about it, even if I myself would prefer this as default, at least in alpha/beta product stage (if this is *not* default, we will unlikely get lot of feedback about it). Right now, I'm using it in a feature tree, as a simple and effective way to find regressions, more effective than a single test in the test suite,- I'd like to get a sufficient level of testing before we release it.

>>It's quite normal to expect that these queries might suffer some performance loss; but the first example shows that in fact >>_other_ queries are affected.Even with only two long queries running at the same time, the delay for unrelated simple short >>queries is percievable; and with 5-10 long queries, the delay for others might be seriously annoying.

Well, I said "specially crafted anti-pattern". I meant by it, that to create something like your example, one has to reduce the thread pool size to 1,  one has to run queries that never yield, and that run long, one has to start all long queries at the same time, and one has to measure the response time of only the very first "normally fast " query.

Reducing thread-pool-size to 1 (a computer with single processor is now quite hard to find in the wild) increases the probability of several long non-yielding queries in the same group, and increases it by a quite a large factor. Starting all  such long queries at the very same time artifically increases the queue size.  And puttng single dummy query  at the very end of the queue is used as evidence that such queries would generally run longer in presence of long non-yielding queries. What is actually measured by this workload, is how long is the  rampup when environment changes from absolutely idle to a flood of lot of "bad" long queries coming from mlutiple clients simultenously.

So if I would come up with anti-threadpool-pattern workload, this will be this: an environment, in which many different clients all suddenly start to issue simultenously long non-yielding queries, and then disconnect/sleep for long time (so that idle threads are removed again) . To finish the picture, one needs to throw couple of clients with short queries to the mix, and they need to run at the same time and the importance of the short queries to complete in short timeframe  needs to be rather high.

This is pretty much  your test case, though yours only had a single "burst". I would not classify that pattern as something common, I  actually cannot come up with a  non-artificial example of it.

IF your concern is about whether or not threadpool will be default, it is not up to me to decide about it, even if I myself would prefer this as default, at least in alpha/beta product stage (if this is *not* default, we will unlikely get lot of feedback about it). Right now,  I'm using it in a feature tree, as a simple and effective way to find regressions, more effective than a single test in the test suite,- I'd like to get a sufficient level of testing before we release it.

Revision history for this message

Elena Stepanova (elenst) wrote on 2012-03-18:

#5

Fix released with 5.5.21.

Changed in maria:
status:	Fix Committed → Fix Released

MariaDB

Thread pool: new threads are created too slowly even with tuning

Bug Description

Related branches

Other bug subscribers

Remote bug watches