Neutron server takes a long time to recover from VIP move

Bug #1288438 reported by Ryan Moe
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
MOS Neutron
neutron
Fix Released
High
Salvatore Orlando
oslo-incubator
Fix Released
High
Roman Podoliaka
neutron (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
In Progress
Medium
Unassigned

Bug Description

Neutron waits sequentially for read_timeout seconds for each connection in its connection pool. The default pool_size is 10 so it takes 10 minutes for Neutron server to be available after the VIP is moved.

This is log output from neutron-server after the VIP has been moved:
2014-03-05 17:48:23.844 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:49:23.887 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:50:24.055 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:51:24.067 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:52:24.079 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:53:24.115 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:54:24.123 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:55:24.131 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:56:24.143 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 17:57:24.163 9899 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')

Here is the log output after the pool_size was changed to 7 and the read_timeout to 30.
2014-03-05 18:50:25.300 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:50:55.331 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:51:25.351 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:51:55.387 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:52:25.415 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:52:55.427 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:53:25.439 15731 WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got mysql server has gone away: (2013, 'Lost connection to MySQL server during query')
2014-03-05 18:53:25.549 15731 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.0.2

Tags: ha
Ryan Moe (rmoe)
description: updated
Ryan Moe (rmoe)
description: updated
Revision history for this message
Ryan Moe (rmoe) wrote :

The issue seems to stem from the checkout pool event handler in openstack/common/db/sqlalchemy/session.py. Each time a connection is requested from the pool this event handler checks that the connection is alive by running a query. In our case this query times out after 'read_timeout' seconds and the connection is recycled in the pool. This happens for each connection in the pool one at a time.

If we dispose of all connections in the pool when any connection fails the problem goes away. This does not seem like a good solution. So far in my testing I have only encountered this issue with neutron even though all services share this same code. I'm not sure why that's the case.

Revision history for this message
Ryan Moe (rmoe) wrote :

On second thought maybe dumping the whole connection pool isn't a terrible idea. It would only happen in the case that it took greater than 60 seconds (or whatever read timeout is) to return a connection from the pool. If it takes > 60 seconds to run "SELECT 1" then either we're in the scenario we have with a VIP that has moved and we want this to happen, or there are serious db problems and dumping the connections in the pool won't have a material effect on anything.

Changed in fuel:
milestone: none → 4.1.1
importance: Undecided → High
assignee: nobody → Sergey Vasilenko (xenolog)
Revision history for this message
Ryan Moe (rmoe) wrote :

After looking at this some more it seems that the neutron agents report their state frequently. Each of those updates requires database queries. These queries are what check out the stale connections from the pool. Each of those stale connections has to wait until mysql kills it. I think if we had more compute nodes in our environment (generating more heartbeat messages) we would see the same issue with nova-conductor. Disposing of the connection pool when we encounter this error seems to be the right thing to do.

Evgeniy L (rustyrobot)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ryan, you are definitely right! Thank you for pointing this out! This is something we must fix in oslo.db first and then projects will receive the fix 'for free' on the next sync from oslo-incubator

Changed in oslo:
importance: Undecided → High
assignee: nobody → Roman Podoliaka (rpodolyaka)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Don't hesitate to put your research results here as well https://etherpad.openstack.org/p/sqlalchemy-connections-tunings

Ben Nemec (bnemec)
Changed in oslo:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/81225

Changed in oslo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo-incubator (master)

Reviewed: https://review.openstack.org/81225
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=5b7e61c82f95187691000b4749ed8307df0766da
Submitter: Jenkins
Branch: master

commit 5b7e61c82f95187691000b4749ed8307df0766da
Author: Roman Podoliaka <email address hidden>
Date: Mon Mar 17 18:38:56 2014 +0200

    Dispose db connections pool on disconnect

    When a disconnect happens (e.g. when VIP moves to another node), all
    existing connections in the pool become invalid. There is no sense to
    check every single connection in the pool, as this may take a long
    time (e.g. when python-mysqldb driver is used with eventlet green
    threads, there is no context switch on db IO, so the whole process is
    blocked until socket read times out). Dispose all connections in the
    pool when disconnect is detected, they will be recreated on demand
    then.

    Closes-Bug: #1288438

    Change-Id: Ia357da2b3092d306a86b6d1787bab374a335f28c

Changed in oslo:
status: In Progress → Fix Committed
Changed in fuel:
milestone: 4.1.1 → 5.0
tags: added: backports-4.1.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this requires backports of additional code and features, e.g. haproxy running in network namespace. thus, removing backports-4.1.1 tag

tags: removed: backports-4.1.1
Thierry Carrez (ttx)
Changed in oslo:
milestone: none → icehouse-rc1
status: Fix Committed → Fix Released
Andrew Woodward (xarses)
tags: added: ha
Thierry Carrez (ttx)
Changed in oslo:
milestone: icehouse-rc1 → 2014.1
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

I can't reproduce this bug on current master.

Changed in fuel:
status: Confirmed → Won't Fix
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

but without ?read_timeout connections restores more quickly.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

If you can't reproduce, then status should rather be Invalid than Won't fix

Changed in fuel:
status: Won't Fix → Invalid
Revision history for this message
Andrew Woodward (xarses) wrote :

fix is not in stable/ice-house, please include in OSCI packages

I have seen nova, and neutron missing this, maybe others

Changed in fuel:
status: Invalid → Triaged
assignee: Sergey Vasilenko (xenolog) → Fuel OSCI Team (fuel-osci)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

guys, we need this patch backported to all the projects using oslo.db in 5.0 release.

Changed in fuel:
assignee: Fuel OSCI Team (fuel-osci) → Fuel Hardening Team (fuel-hardening)
Changed in fuel:
status: Triaged → In Progress
Revision history for this message
Sergey Otpuschennikov (sotpuschennikov) wrote :
Revision history for this message
Sergey Otpuschennikov (sotpuschennikov) wrote :
Changed in neutron:
importance: Undecided → High
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/92792

Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
status: Triaged → In Progress
Changed in neutron:
milestone: none → juno-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/92792
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b3257e09dbcbd87ac1c6f70a0ea255197d790835
Submitter: Jenkins
Branch: master

commit b3257e09dbcbd87ac1c6f70a0ea255197d790835
Author: Salvatore Orlando <email address hidden>
Date: Thu May 8 01:37:17 2014 -0700

    Start ping listener also for postgresql

    Make use of the automatic connection disposal for
    PostgreSQL as well.

    oslo-incubator commit: 4a591eaafe035fa3fd508c910f057bfa8ff51005

    Change-Id: I6e6adb612968639dbc8deae31e5409aac44e340e
    Closes-Bug: 1288438

Changed in neutron:
status: In Progress → Fix Committed
Roman Vyalov (r0mikiam)
Changed in fuel:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Ilya Shakhat (shakhat)
Changed in fuel:
assignee: Fuel Hardening Team (fuel-hardening) → MOS Neutron (mos-neutron)
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-1 → 2014.2
Changed in neutron (Ubuntu Trusty):
assignee: nobody → Mario Splivalo (mariosplivalo)
Changed in neutron (Ubuntu):
importance: Undecided → Medium
status: New → Fix Released
Changed in neutron (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → Medium
tags: added: sts-sru
tags: removed: sts-sru
Changed in neutron (Ubuntu Trusty):
assignee: Mario Splivalo (mariosplivalo) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.