DB errors for large/nested stacks with convergence enabled

Bug #1566845 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
In Progress
High
Thomas Herve

Bug Description

I re-ran this reproducer https://bugs.launchpad.net/heat/+bug/1489548 with and without convergence enabled:

heat event-list stress
| resource_name | id | resource_status_reason | resource_status | event_time |
+---------------+
| stress | f6cb0c4f-97ea-4c2f-b4d1-cf658c831a9d | Stack CREATE started | CREATE_IN_PROGRESS | 2016-04-06T12:44:58 |
| NovaComputes | 8828f0a4-04f7-49ff-9bb4-f9cd97d8371b | state changed | CREATE_IN_PROGRESS | 2016-04-06T12:44:58 |
| NovaComputes | feaa933f-0c7d-4cfa-a544-be7e215399c2 | state changed | CREATE_COMPLETE | 2016-04-06T12:47:09 |
| stress | 61a39dc3-e883-44d1-bb4c-18c30852a431 | Stack CREATE completed successfully | CREATE_COMPLETE | 2016-04-06T12:47:09 |

Without convergence, it works OK and completes in a little over 2 minutes.

With convergence enabled (convergence_engine=True in heat.conf), we see this:

$ heat event-list stress_conv
| resource_name | id | resource_status_reason | resource_status | event_time |
+---------------+
| stress_conv | dd04a4fa-2b91-48a8-b22a-75afdf8f7a9a | Stack CREATE started | CREATE_IN_PROGRESS | 2016-04-06T12:51:00 |
| NovaComputes | 4ff9141d-0fa8-4228-9be1-95e09320150b | state changed | CREATE_IN_PROGRESS | 2016-04-06T12:51:00 |
| NovaComputes | 2d37a864-f392-45f3-be51-411ca2ffef8b | OperationalError: resources.NovaComputes: (pymysql.err.OperationalError) (1040, u'Too many connections') | CREATE_FAILED | 2016-04-06T12:51:43 |
| stress_conv | 8a4d70de-7712-4c13-8863-f42e54994547 | Resource CREATE failed: OperationalError: resources.NovaComputes: (pymysql.err.OperationalError) (1040, u'Too many connections') | CREATE_FAILED | 2016-04-06T12:51:50 |

Also the failure appears to leak memory - this is the top output for one of the workers (4 workers on devstack) - we see the memory usage go from ~76M up to way over 200, and this isn't released after the stack goes CREATE_FAILED.

29890 stack 20 0 306036 78056 4244 S 1.7 0.5 0:00.49 heat-engine
29890 stack 20 0 306036 78056 4244 S 2.0 0.5 0:00.55 heat-engine
29890 stack 20 0 306036 78056 4244 S 2.0 0.5 0:00.61 heat-engine
29890 stack 20 0 306036 78056 4244 S 2.3 0.5 0:00.68 heat-engine
29890 stack 20 0 306036 78056 4244 S 1.7 0.5 0:00.73 heat-engine
29890 stack 20 0 327148 99660 4628 R 53.8 0.6 0:02.35 heat-engine
29890 stack 20 0 349072 121408 4628 R 87.7 0.8 0:04.99 heat-engine
29890 stack 20 0 377520 150060 4628 R 79.1 0.9 0:07.38 heat-engine
29890 stack 20 0 420844 192204 8436 R 82.4 1.2 0:09.86 heat-engine
29890 stack 20 0 448996 220336 8436 R 87.1 1.4 0:12.49 heat-engine
29890 stack 20 0 472304 243884 8436 R 83.7 1.5 0:15.01 heat-engine
29890 stack 20 0 490224 261632 8436 R 78.5 1.6 0:17.38 heat-engine
29890 stack 20 0 502576 273912 8436 R 59.9 1.7 0:19.19 heat-engine
29890 stack 20 0 506192 277552 8436 S 54.6 1.7 0:20.84 heat-engine
29890 stack 20 0 519808 291472 8436 S 49.2 1.8 0:22.32 heat-engine
29890 stack 20 0 519808 291472 8436 S 39.9 1.8 0:23.52 heat-engine
29890 stack 20 0 519808 291472 8436 R 59.3 1.8 0:25.31 heat-engine
29890 stack 20 0 519808 291472 8436 R 54.6 1.8 0:26.96 heat-engine
29890 stack 20 0 519808 291472 8436 R 75.8 1.8 0:29.25 heat-engine

There's a bunch of errors like this in the heat logs:

File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 890, in connect\n self._get_server_information()\n', u' File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1190, in _get_server_information\n packet = self._read_packet()\n', u' File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 966, in _read_packet\n packet.check_error()\n', u' File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 394, in check_error\n err.raise_mysql_exception(self._data)\n', u' File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception\n _check_mysql_exception(errinfo)\n', u' File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 112, in _check_mysql_exception\n raise errorclass(errno, errorvalue)\n', u"OperationalError: (pymysql.err.OperationalError) (1040, u'Too many connections')\n"].^[[00m

So, I guess we have to bump the connection limit, but that doesn't resolve the memory usage/leak.

Contrasting this to the non-convergence case, I see memory usage roughly static at about 100MB per engine process (which still seems quite high).

Revision history for this message
Rabi Mishra (rabi) wrote :
tags: added: gate-failure
Changed in heat:
importance: Undecided → Critical
Thomas Herve (therve)
Changed in heat:
milestone: none → ocata-2
Changed in heat:
assignee: nobody → Crag Wolfe (cwolfe)
status: New → In Progress
Revision history for this message
Crag Wolfe (cwolfe) wrote :

With https://review.openstack.org/#/c/399811/ , I can create a stack with 1000 resources without an error (using templates in https://bugs.launchpad.net/heat/+bug/1489548 ). Otherwise, I hit the 'Too many connections' error with 200 resources. It's not a proper fix, just trying to get us past gate issues at the moment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/400155

Changed in heat:
assignee: Crag Wolfe (cwolfe) → Thomas Herve (therve)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/400388

Changed in heat:
assignee: Thomas Herve (therve) → Crag Wolfe (cwolfe)
Revision history for this message
Crag Wolfe (cwolfe) wrote :

One thing I noticed in debugging this issue is that the 'Too many connections' error often happens when counting the number of events in a stack. We ought not to be hitting the db to count events per stack every time we want to create an event. My patch above is one way to deal with the issue.

Revision history for this message
Thomas Herve (therve) wrote :

The error happens when counting because we store an event in case of errors, and we count before storing the event. I don't think it's related to the number of connections itself, it just hides the real cause.

Revision history for this message
Crag Wolfe (cwolfe) wrote :

I agree it's not the real cause. I'll remove the Partial-Bug reference if there is another iteration of the patch set.

Changed in heat:
assignee: Crag Wolfe (cwolfe) → Thomas Herve (therve)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Crag Wolfe (<email address hidden>) on branch: master
Review: https://review.openstack.org/399811
Reason: No longer needed with https://review.openstack.org/#/c/361570 .

Rabi Mishra (rabi)
Changed in heat:
importance: Critical → High
milestone: ocata-2 → ocata-3
Rabi Mishra (rabi)
Changed in heat:
milestone: ocata-3 → ocata-rc1
Rabi Mishra (rabi)
Changed in heat:
milestone: ocata-rc1 → pike-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/400155
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4a500125b350b46dee0d3c9f01c3cac7223d9c80
Submitter: Jenkins
Branch: master

commit 4a500125b350b46dee0d3c9f01c3cac7223d9c80
Author: Thomas Herve <email address hidden>
Date: Mon Nov 21 10:43:56 2016 +0100

    Sleep between check_resource calls

    Instead of calling check_resource on all leaves in the resource graph at
    once, sleep a little bit between each call. As it's a tad slower,
    delegate it to a thread so that the stack_create RPC message doesn't
    timeout when you have lots of resources.

    Change-Id: I84d2b34d65b3ce7d8d858de106dac531aff509b7
    Partial-Bug: #1566845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/445355

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/ocata)

Reviewed: https://review.openstack.org/445355
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=73f299fb495a5535017a867a10d55bb76fcc5d32
Submitter: Jenkins
Branch: stable/ocata

commit 73f299fb495a5535017a867a10d55bb76fcc5d32
Author: Thomas Herve <email address hidden>
Date: Mon Nov 21 10:43:56 2016 +0100

    Sleep between check_resource calls

    Instead of calling check_resource on all leaves in the resource graph at
    once, sleep a little bit between each call. As it's a tad slower,
    delegate it to a thread so that the stack_create RPC message doesn't
    timeout when you have lots of resources.

    Change-Id: I84d2b34d65b3ce7d8d858de106dac531aff509b7
    Partial-Bug: #1566845
    (cherry picked from commit 4a500125b350b46dee0d3c9f01c3cac7223d9c80)

tags: added: in-stable-ocata
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-1 → pike-2
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-2 → pike-3
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-3 → pike-rc1
Rabi Mishra (rabi)
Changed in heat:
milestone: pike-rc1 → pike-rc2
Revision history for this message
Rabi Mishra (rabi) wrote :

Not sure if we want to do anything more for this bug. Moving this to ongoing.

Changed in heat:
milestone: pike-rc2 → ongoing
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.