Unresponsive services during failover

Bug #1896635 reported by Pierre Riteau
16
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Medium
Pierre Riteau
Stein
Fix Released
Medium
Mark Goddard
Train
Fix Released
Medium
Mark Goddard
Ussuri
Fix Released
Medium
Mark Goddard
Victoria
Fix Released
Medium
Pierre Riteau

Bug Description

When the internal VIP is moved in the event of a failure of the active controller, OpenStack services can become unresponsive for a while. Logs show errors like this one:

oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query'
)

This appears to be caused by services trying to use DB connections in the SQLAlchemy pool which have become invalid due to the move of the VIP.

Changed in kolla-ansible:
assignee: nobody → Pierre Riteau (priteau)
status: New → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :
Changed in kolla-ansible:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/749632
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=c81772024c70b84564cfddb29645390ae378498f
Submitter: Zuul
Branch: master

commit c81772024c70b84564cfddb29645390ae378498f
Author: Pierre Riteau <email address hidden>
Date: Tue Sep 22 17:52:36 2020 +0200

    Reduce the use of SQLAlchemy connection pooling

    When the internal VIP is moved in the event of a failure of the active
    controller, OpenStack services can become unresponsive as they try to
    talk with MariaDB using connections from the SQLAlchemy pool.

    It has been argued that OpenStack doesn't really need to use connection
    pooling with MariaDB [1]. This commit reduces the use of connection
    pooling via two configuration options:

    - max_pool_size is set to 1 to allow only a single connection in the
      pool (it is not possible to disable connection pooling entirely via
      oslo.db, and max_pool_size = 0 means unlimited pool size)
    - lower connection_recycle_time from the default of one hour to 10
      seconds, which means the single connection in the pool will be
      recreated regularly

    These settings have shown better reactivity of the system in the event
    of a failover.

    [1] http://lists.openstack.org/pipermail/openstack-dev/2015-April/061808.html

    Change-Id: Ib6a62d4428db9b95569314084090472870417f3d
    Closes-Bug: #1896635

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/754928

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/754929

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/754931

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/754929
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=903efc5a9880eee960f131f7954ec568e6f60df1
Submitter: Zuul
Branch: stable/train

commit 903efc5a9880eee960f131f7954ec568e6f60df1
Author: Pierre Riteau <email address hidden>
Date: Tue Sep 22 17:52:36 2020 +0200

    Reduce the use of SQLAlchemy connection pooling

    When the internal VIP is moved in the event of a failure of the active
    controller, OpenStack services can become unresponsive as they try to
    talk with MariaDB using connections from the SQLAlchemy pool.

    It has been argued that OpenStack doesn't really need to use connection
    pooling with MariaDB [1]. This commit reduces the use of connection
    pooling via two configuration options:

    - max_pool_size is set to 1 to allow only a single connection in the
      pool (it is not possible to disable connection pooling entirely via
      oslo.db, and max_pool_size = 0 means unlimited pool size)
    - lower connection_recycle_time from the default of one hour to 10
      seconds, which means the single connection in the pool will be
      recreated regularly

    These settings have shown better reactivity of the system in the event
    of a failover.

    [1] http://lists.openstack.org/pipermail/openstack-dev/2015-April/061808.html

    Change-Id: Ib6a62d4428db9b95569314084090472870417f3d
    Closes-Bug: #1896635
    (cherry picked from commit c81772024c70b84564cfddb29645390ae378498f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/754928
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=1d4fd52e4e77d5b93f21ccf9f382ae110459f3ff
Submitter: Zuul
Branch: stable/ussuri

commit 1d4fd52e4e77d5b93f21ccf9f382ae110459f3ff
Author: Pierre Riteau <email address hidden>
Date: Tue Sep 22 17:52:36 2020 +0200

    Reduce the use of SQLAlchemy connection pooling

    When the internal VIP is moved in the event of a failure of the active
    controller, OpenStack services can become unresponsive as they try to
    talk with MariaDB using connections from the SQLAlchemy pool.

    It has been argued that OpenStack doesn't really need to use connection
    pooling with MariaDB [1]. This commit reduces the use of connection
    pooling via two configuration options:

    - max_pool_size is set to 1 to allow only a single connection in the
      pool (it is not possible to disable connection pooling entirely via
      oslo.db, and max_pool_size = 0 means unlimited pool size)
    - lower connection_recycle_time from the default of one hour to 10
      seconds, which means the single connection in the pool will be
      recreated regularly

    These settings have shown better reactivity of the system in the event
    of a failover.

    [1] http://lists.openstack.org/pipermail/openstack-dev/2015-April/061808.html

    Change-Id: Ib6a62d4428db9b95569314084090472870417f3d
    Closes-Bug: #1896635
    (cherry picked from commit c81772024c70b84564cfddb29645390ae378498f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/754931
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=9ee209080982e66acff3f15fd64c9db965084f8b
Submitter: Zuul
Branch: stable/stein

commit 9ee209080982e66acff3f15fd64c9db965084f8b
Author: Pierre Riteau <email address hidden>
Date: Tue Sep 22 17:52:36 2020 +0200

    Reduce the use of SQLAlchemy connection pooling

    When the internal VIP is moved in the event of a failure of the active
    controller, OpenStack services can become unresponsive as they try to
    talk with MariaDB using connections from the SQLAlchemy pool.

    It has been argued that OpenStack doesn't really need to use connection
    pooling with MariaDB [1]. This commit reduces the use of connection
    pooling via two configuration options:

    - max_pool_size is set to 1 to allow only a single connection in the
      pool (it is not possible to disable connection pooling entirely via
      oslo.db, and max_pool_size = 0 means unlimited pool size)
    - lower connection_recycle_time from the default of one hour to 10
      seconds, which means the single connection in the pool will be
      recreated regularly

    These settings have shown better reactivity of the system in the event
    of a failover.

    [1] http://lists.openstack.org/pipermail/openstack-dev/2015-April/061808.html

    Change-Id: Ib6a62d4428db9b95569314084090472870417f3d
    Closes-Bug: #1896635
    (cherry picked from commit c81772024c70b84564cfddb29645390ae378498f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.3.0

This issue was fixed in the openstack/kolla-ansible 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 10.2.0

This issue was fixed in the openstack/kolla-ansible 10.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.3.0

This issue was fixed in the openstack/kolla-ansible 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.