Fuel for OpenStack

Neutron server takes a long time to recover from VIP move

Bug #1288438 reported by Ryan Moe on 2014-03-05

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	MOS Neutron	Fuel for OpenStack 5.0
neutron	Fix Released	High	Salvatore Orlando	neutron 2014.2 "juno"
oslo-incubator	Fix Released	High	Roman Podoliaka	oslo-incubator 2014.1 "icehouse"
neutron (Ubuntu)	Fix Released	Medium	Unassigned
Trusty	In Progress	Medium	Unassigned

Bug Description

Neutron waits sequentially for read_timeout seconds for each connection in its connection pool. The default pool_size is 10 so it takes 10 minutes for Neutron server to be available after the VIP is moved.

This is log output from neutron-server after the VIP has been moved:
2014-03-05 17:48:23.844 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:49:23.887 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:50:24.055 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:51:24.067 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:52:24.079 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:53:24.115 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:54:24.123 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:55:24.131 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:56:24.143 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:57:24.163 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')

Here is the log output after the pool_size was changed to 7 and the read_timeout to 30.
2014-03-05 18:50:25.300 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:50:55.331 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:51:25.351 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:51:55.387 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:52:25.415 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:52:55.427 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:53:25.439 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:53:25.549 15731 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2

See original description

Tags:

Ryan Moe (rmoe) on 2014-03-05

description:

updated

Ryan Moe (rmoe) on 2014-03-05

description:

updated

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-03-06:

The issue seems to stem from the checkout pool event handler in openstack/common/db/sqlalchemy/session.py. Each time a connection is requested from the pool this event handler checks that the connection is alive by running a query. In our case this query times out after 'read_timeout' seconds and the connection is recycled in the pool. This happens for each connection in the pool one at a time.

If we dispose of all connections in the pool when any connection fails the problem goes away. This does not seem like a good solution. So far in my testing I have only encountered this issue with neutron even though all services share this same code. I'm not sure why that's the case.

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-03-06:

On second thought maybe dumping the whole connection pool isn't a terrible idea. It would only happen in the case that it took greater than 60 seconds (or whatever read timeout is) to return a connection from the pool. If it takes > 60 seconds to run "SELECT 1" then either we're in the scenario we have with a VIP that has moved and we want this to happen, or there are serious db problems and dumping the connections in the pool won't have a material effect on anything.

Vladimir Kuklin (vkuklin) on 2014-03-06

Changed in fuel:
milestone:	none → 4.1.1
importance:	Undecided → High
assignee:	nobody → Sergey Vasilenko (xenolog)

Revision history for this message

Ryan Moe (rmoe) wrote on 2014-03-07:

After looking at this some more it seems that the neutron agents report their state frequently. Each of those updates requires database queries. These queries are what check out the stale connections from the pool. Each of those stale connections has to wait until mysql kills it. I think if we had more compute nodes in our environment (generating more heartbeat messages) we would see the same issue with nova-conductor. Disposing of the connection pool when we encounter this error seems to be the right thing to do.

Evgeniy L (rustyrobot) on 2014-03-07

Changed in fuel:
status:	New → Confirmed

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2014-03-16:

Ryan, you are definitely right! Thank you for pointing this out! This is something we must fix in oslo.db first and then projects will receive the fix 'for free' on the next sync from oslo-incubator

Changed in oslo:
importance:	Undecided → High
assignee:	nobody → Roman Podoliaka (rpodolyaka)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-03-17:

Don't hesitate to put your research results here as well https://etherpad.openstack.org/p/sqlalchemy-connections-tunings

Ben Nemec (bnemec) on 2014-03-17

Changed in oslo:
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-18: Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/81225

Changed in oslo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-18: Fix merged to oslo-incubator (master)

Reviewed: https://review.openstack.org/81225
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=5b7e61c82f95187691000b4749ed8307df0766da
Submitter: Jenkins
Branch: master

commit 5b7e61c82f95187691000b4749ed8307df0766da
Author: Roman Podoliaka <email address hidden>
Date: Mon Mar 17 18:38:56 2014 +0200

Dispose db connections pool on disconnect

    When a disconnect happens (e.g. when VIP moves to another node), all
    existing connections in the pool become invalid. There is no sense to
    check every single connection in the pool, as this may take a long
    time (e.g. when python-mysqldb driver is used with eventlet green
    threads, there is no context switch on db IO, so the whole process is
    blocked until socket read times out). Dispose all connections in the
    pool when disconnect is detected, they will be recreated on demand
    then.

Closes-Bug: #1288438

Change-Id: Ia357da2b3092d306a86b6d1787bab374a335f28c

Changed in oslo:
status:	In Progress → Fix Committed

Vladimir Kuklin (vkuklin) on 2014-03-24

Changed in fuel:
milestone:	4.1.1 → 5.0

Vladimir Kuklin (vkuklin) on 2014-03-24

tags:

added: backports-4.1.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-03-26:

this requires backports of additional code and features, e.g. haproxy running in network namespace. thus, removing backports-4.1.1 tag

tags:

removed: backports-4.1.1

Thierry Carrez (ttx) on 2014-04-04

Changed in oslo:
milestone:	none → icehouse-rc1
status:	Fix Committed → Fix Released

Andrew Woodward (xarses) on 2014-04-04

tags:

added: ha

Thierry Carrez (ttx) on 2014-04-17

Changed in oslo:
milestone:	icehouse-rc1 → 2014.1

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-04-29:

I can't reproduce this bug on current master.

Changed in fuel:
status:	Confirmed → Won't Fix

Revision history for this message

Sergey Vasilenko (xenolog) wrote on 2014-04-29:

#10

but without ?read_timeout connections restores more quickly.

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-04-30:

#11

If you can't reproduce, then status should rather be Invalid than Won't fix

Sergey Vasilenko (xenolog) on 2014-04-30

Changed in fuel:
status:	Won't Fix → Invalid

Revision history for this message

Andrew Woodward (xarses) wrote on 2014-05-05:

#12

fix is not in stable/ice-house, please include in OSCI packages

I have seen nova, and neutron missing this, maybe others

Changed in fuel:
status:	Invalid → Triaged
assignee:	Sergey Vasilenko (xenolog) → Fuel OSCI Team (fuel-osci)

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-05-06:

#13

guys, we need this patch backported to all the projects using oslo.db in 5.0 release.

Vitaly Parakhin (vparakhin) on 2014-05-06

Changed in fuel:
assignee:	Fuel OSCI Team (fuel-osci) → Fuel Hardening Team (fuel-hardening)

Andriy Kurilin (andreykurilin) on 2014-05-06

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

Sergey Otpuschennikov (sotpuschennikov) wrote on 2014-05-06:

#14

Patches:
ceilometer - https://gerrit.mirantis.com/#/c/15252/
cinder - https://gerrit.mirantis.com/#/c/15257/
murano-api - https://gerrit.mirantis.com/#/c/15253/
neutron - https://gerrit.mirantis.com/#/c/15254/
nova - https://gerrit.mirantis.com/#/c/15251/
sahara - https://gerrit.mirantis.com/#/c/15256/

Revision history for this message

Sergey Otpuschennikov (sotpuschennikov) wrote on 2014-05-07:

#15

Uptade: ceilometr - https://gerrit.mirantis.com/#/c/15258/

Salvatore Orlando (salvatore-orlando) on 2014-05-08

Changed in neutron:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-08: Fix proposed to neutron (master)

#16

Fix proposed to branch: master
Review: https://review.openstack.org/92792

Changed in neutron:
assignee:	nobody → Salvatore Orlando (salvatore-orlando)
status:	Triaged → In Progress

Salvatore Orlando (salvatore-orlando) on 2014-05-08

Changed in neutron:
milestone:	none → juno-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-08: Fix merged to neutron (master)

#17

Reviewed: https://review.openstack.org/92792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b3257e09dbcbd87ac1c6f70a0ea255197d790835
Submitter: Jenkins
Branch: master

commit b3257e09dbcbd87ac1c6f70a0ea255197d790835
Author: Salvatore Orlando <email address hidden>
Date: Thu May 8 01:37:17 2014 -0700

Start ping listener also for postgresql

Make use of the automatic connection disposal for
PostgreSQL as well.

oslo-incubator commit: 4a591eaafe035fa3fd508c910f057bfa8ff51005

Change-Id: I6e6adb612968639dbc8deae31e5409aac44e340e
Closes-Bug: 1288438

Changed in neutron:
status:	In Progress → Fix Committed

Roman Vyalov (r0mikiam) on 2014-05-12

Changed in fuel:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-06-12

Changed in neutron:
status:	Fix Committed → Fix Released

Ilya Shakhat (shakhat) on 2014-06-19

Changed in fuel:
assignee:	Fuel Hardening Team (fuel-hardening) → MOS Neutron (mos-neutron)

Thierry Carrez (ttx) on 2014-10-16

Changed in neutron:
milestone:	juno-1 → 2014.2

Jorge Niedbalski (niedbalski) on 2016-03-04

Changed in neutron (Ubuntu Trusty):
assignee:	nobody → Mario Splivalo (mariosplivalo)
Changed in neutron (Ubuntu):
importance:	Undecided → Medium
status:	New → Fix Released
Changed in neutron (Ubuntu Trusty):
status:	New → In Progress
importance:	Undecided → Medium