Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool

Bug #1184484 reported by Clint Byrum
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Salvatore Orlando
tripleo
Fix Released
Critical
Joe Gordon

Bug Description

quantum-server by default will create an sqlalchemy_pool of 5 connections. This will start to fail as more and more data and compute nodes are added. Raising it to 40 seems to stop the problem, it may need to go higher. Raising it to 20 still results in:

TimeoutError: QueuePool limit of size 20 overflow 10 reached, connection timed out, timeout 30

Raising it to 40 seems to have restored responsiveness. I suspect it may be too low as the number of clients increases.

[DATABASE]
sqlalchemy_pool_size = 40

Tags: db
Changed in tripleo:
importance: Undecided → Critical
status: New → Triaged
description: updated
Aaron Rosen (arosen)
tags: added: db
Revision history for this message
Robert Collins (lifeless) wrote :

I've taken it up to 60; I triggered it again.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

This is a structural issue in Quantum.
Two factors contribute to it:
- Some transactions are very long, creating a larger interval when a connection is used before being returned to the pool
- Current db pooling code does not work well with nested transaction and causes a distinct connection to be taken from the pool for each subtransaction. We are moving to oslo DB pooling which will solve this part of the issue.

I reckon that migrating to oslo DB will make things a lot more resilient.
Sorting out usage of DB transactions in Quantum code might take longer.

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 1184484] Re: Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool

Excerpts from Salvatore Orlando's message of 2013-05-29 09:58:27 UTC:
> This is a structural issue in Quantum.
> Two factors contribute to it:
> - Some transactions are very long, creating a larger interval when a connection is used before being returned to the pool
> - Current db pooling code does not work well with nested transaction and causes a distinct connection to be taken from the pool for each subtransaction. We are moving to oslo DB pooling which will solve this part of the issue.
>
> I reckon that migrating to oslo DB will make things a lot more resilient.
> Sorting out usage of DB transactions in Quantum code might take longer.
>

Sounds like a legitimate bug that warrants an appropriate Status and
Importance. Might I suggest "Triaged" and "Critical" ?

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I will set it to triaged, but I will leave to the db team lead setting the priority.
High to Critical sounds reasonable to me.

I think the implementation of blueprint oslo-db-support and the fix for bug #1179745 will bring us to a stage where this problem should not occur anymore.

Changed in quantum:
status: New → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :

Arosen says ...
07:05 < arosen> lifeless: are you guys running with: sql_dbpool_enable = True ?
07:06 < lifeless> let me see07:07 < lifeless> arosen:
https://github.com/stackforge/tripleo-image-elements/blob/master/elements/quantum/os-config-applier/etc/quantum/quantum.conf
07:07 < lifeless> arosen: doesn't look like it
07:07 < lifeless> arosen: is this another off-by-default-but-really-should-be-on-setting ?
07:08 < arosen> lifeless: It should be in the agent configfile in the database section
https://github.com/stackforge/tripleo-image-elements/blob/master/elements/quantum/os-config-applier/etc/quantum/plugins/openvswitch/ovs_quantum_plugin.ini#L3
07:09 < arosen> lifeless: i'm not sure why it's off by default. I know what we have it turn on in our
                production cloud. I'd give it a shot and see if it helps.

Aaron Rosen (arosen)
Changed in quantum:
importance: Undecided → High
Revision history for this message
Robert Collins (lifeless) wrote :

07:13 < lifeless> arosen: will do; so default value for the pool, but set sql_dbpool_enable=True on in the same
                  section
07:13 < arosen> yup in the [DATABASE] section

Revision history for this message
Robert Collins (lifeless) wrote :

Now hit...
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 60 overflow 10 reached, connection timed out, timeout 30

so this is a good time to try the sql_dbpool_enable instead : will report back on how it goes.

Revision history for this message
Robert Collins (lifeless) wrote :

setting sql_dbpool_enable=True in the ovs plugin caused quantum net-list to hang, with or without the pool size setting we have added to workaround this bug.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks for the further details Robert.

I still have a suspect the following patches:
https://review.openstack.org/#/c/27265/
https://review.openstack.org/#/c/29513/

will improve, if not remove altogether, the scale issue that is being reported here.
I will verify whether this is correct, and if not, try and fix the issue.

Changed in quantum:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
milestone: none → havana-2
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

It really seems the two patches linked in the previous comment solve this issue.
I have cherry picked them on top of current master.
The resulting branch is here: https://github.com/salv-orlando/quantum/tree/db_improved

Given my limited resources for testing I simulated what would normally happen with interactions with nova, the dhcp agent, and tenants. I've used several simple scripts to this aim, but this probably sums up all of them: https://gist.github.com/salv-orlando/5778033

The current trunk fails even with relatively small loads (30 concurrent scripts executing the 'read' part with the default 5 connection - even less when db pools are enabled).
The code with this two patches appear to be more resilient. No error was returned even with 300 concurrent 'read' parts being executed, and still with the default 5 connections in the pool.

Also, execution of the script was much faster when the two patches were applied.

I reckon we should work to merge them as soon as possible - they should allow us to mark this bug as fixed too.

Revision history for this message
Robert Collins (lifeless) wrote :

I tried pulling that branch into our current POC environment, but it resulted in quantum net-list showing nothing (ditto subnetlist, and port-list), with no errors on the server side.

I can bring up a new test environment, or in a few weeks our current test environment will be freed up and I can tear it down and drop a new quantum in there. Sorry!

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

That branch is just master + 2 cherry-picked commits.
I had the same problem as yours; it was due to a change in the configuration for the sql connetion section, which was supposed to be backward compatibile but apparently isn't. The sql connection is being ignored and it defaults to an empty sqlite db.

In ovs_quantum_plugin.ini, try and change [DATABASE] section to lowercase, and sql_connection to connection. This worked for me, something like the following:

[database]
connection =mysql://root:password@localhost/ovs_quantum?charset=utf8

Revision history for this message
Robert Collins (lifeless) wrote :

Ugh: Thats backwards incompatible; is that problem in master, or in the new patches?

Revision history for this message
Robert Collins (lifeless) wrote :

Ok, I've applied these patches and removed our setting of
sqlalchemy_pool_size = 40
...
and the problem still occurs:
2013-06-17 22:26:22.279 25310 TRACE quantum.openstack.common.rpc.amqp TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10
2013-06-17 22:26:22.279 25310 TRACE quantum.openstack.common.rpc.amqp

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks for letting me know.
I will investigate further and root cause this issue.

Revision history for this message
Robert Collins (lifeless) wrote :

Thanks! I will be happy to gather any data or do tests as needed (though it should be trivially reproducable in devstack; if it's not that would be an interesting datapoint).

Revision history for this message
Jack McCann (jack-mccann) wrote :

We were seeing the "TimeoutError: QueuePool limit" error. We tried bumping the pool size up, but eventually we'd hit the problem again as load increased. We wound up with a patch that gives the option for multiple api "worker" processes, as in glance, and this made quite a difference. We can put that patch up for consideration if folks are interested.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

It's surely worth having a look - please push it to gerrit or attach it to this bug report!

As far as this particular issue is concerned, according to Robert's report it happens also at very small scale (10 concurrent request are reasonably small imho), where no multiple workers should be needed.

I would therefore tend to pinpint and fix this issue first, and then look at multiple workers for increasing scale, which is somethign definetely worth looking at.

PS: Sorry I was busy with something else in the last 3 days.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I have been stress testing Neutron server for about a week now - and I've been unable to reproduce this issue.

My setup is:
- 1 controller node with neutron-server, neutron-agent, neutron-dhcp-agent, neutron-l3-agent, neutron-metadata-agent, nova-cpu, nova-scheduler, nova-api, nova-conductor, glance-api, glance-registry
- 4 compute nodes with nova-cpu only

In order to try an exacerbate the problem I have set:
max_pool_size = 1
max_overflow_size = 1

Stress testing has been performed in the following way
1) Keep executing nova list 5 times per second (also executes port-list on neutron)
2) Spawn concurrently instances with --num-instances parameter on nova boot (25 instances at a time)
3) Spawn concurrently instances with create server requests for single instances (25 instances at a time, requests executed in parallel)
4) Same experiment as #3 but creating a distinct network and a subnet for each instance

Instances are continuosly created and destroyed, with an interval of 20 seconds between the execution of the nova boot and nova-delete commands.

The test has been running for over 48 hours now, and no exception was reported on quantum; no error in VM spawn was reported as well.

It is possible that there might be some configuration differences between my repro environment and the ones were the failures are occuring. would it be possible to have a look at least at the [database] section for the system where the failure is manifesting?

Joe Gordon (jogo)
Changed in tripleo:
assignee: nobody → Joe Gordon (jogo)
Revision history for this message
Robert Collins (lifeless) wrote :

Joe Gordon is going to have a go at reproducing; I will copy the database section out for you later today.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks guys.

Revision history for this message
Joe Gordon (jogo) wrote :
Revision history for this message
Jack McCann (jack-mccann) wrote :

A while back I mentioned we had a patch for multiple api worker processes that got us around this bug, Carl Baldwin has just put the code up as a WIP at https://review.openstack.org/#/c/37131. We've been beating the heck out of this for a couple months now and its been holding up pretty well. It's based on similar code in glance.

Changed in neutron:
milestone: havana-2 → havana-3
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Just a reminder that I am waiting for further info which might help repro the issue.

Revision history for this message
Joe Gordon (jogo) wrote :

I was unable to reproduce this issue as well. I tried spinning up 200 VMs in an devstack environment and repeatedly ran nova list.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Thanks Joe.

Your tests are quite similar to mine. I just tried to boot nova instances in two different fashions:
#1 - using the --num-instances command line option
#2 - executing nova-boot concurrently n times

And in both cases, just like you, I was running nova list (once a second in my case)
I think we should here from lifeless before closing this issue.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I've used the bug squash day to try and reproduce this issue, again.
The good news (probably) is that the issue was unreproducible, even with fairly high level of parallelism (4 distinct client request concurrent launch of 25 instances).

So I'm marking this bug as Fix Committed at the moment.
Feel free to reopen it if you see the issue again.

Marking as Fix Committed rather than Invalid because the bug was actually there, but improvements in the Havana releaser around db management solved it.

Changed in neutron:
status: Triaged → Fix Committed
Revision history for this message
Ravi Chunduru (ravivsn) wrote :

Any suggestion/fix for Grizzly release?

Changed in tripleo:
status: Triaged → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
kedar kulkarni (kedar-kulkarni) wrote :

This issue caused us in VM's getting multiple ip addresses which is weird.
Is the patch in https://review.openstack.org/#/c/37131 compatible with grizzly?
I mean can we use multiple worker processes in grizzly? also with backward compatibility?

Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-3 → 2013.2
tags: added: grizzly-backport-potential
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers