Bug #1889731 “charm-octavia doesn't apply worker-multiplier thus...” : Bugs : OpenStack Octavia Charm

Revision history for this message

Nobuto Murata (nobuto) wrote on 2020-07-31:

#1

I think the section below needs to be applied to the template especially health_update_threads and stats_update_threads.

openstack/octavia (master=)$ git grep -C 7 processors etc/octavia.conf
etc/octavia.conf-
etc/octavia.conf-[health_manager]
etc/octavia.conf-# bind_ip = 127.0.0.1
etc/octavia.conf-# bind_port = 5555
etc/octavia.conf-# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555
etc/octavia.conf-# controller_ip_port_list =
etc/octavia.conf-# failover_threads = 10
etc/octavia.conf:# health_update_threads will default to the number of processors on the host
etc/octavia.conf-# health_update_threads =
etc/octavia.conf:# stats_update_threads will default to the number of processors on the host
etc/octavia.conf-# stats_update_threads =
etc/octavia.conf-# heartbeat_interval = 10
etc/octavia.conf-# heartbeat_key =
etc/octavia.conf-# heartbeat_timeout = 60
etc/octavia.conf-# health_check_interval = 3
etc/octavia.conf-# sock_rlimit = 0
etc/octavia.conf-

Revision history for this message

Nobuto Murata (nobuto) wrote on 2020-07-31:

#2

I'm on the stable release of the charm so I see the worker-multiplier option.
https://jaas.ai/octavia/21#charm-config-worker-multiplier

However, it has been removed in the master somehow without a clear explanation.
https://review.opendev.org/#/c/728769/

We definitely need that option in charm-octavia.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2020-07-31:

#3

Subscribing ~field-high.

The documented functionality as worker-multiplier is not working and we don't have a controller over the number of processes to be spawned. 480 connections (80 threads system * 2 services * 3 units) just with Octavia is large enough to reach to the limit of MySQL connections in production.

Frode Nordahl (fnordahl) on 2020-07-31

Changed in charm-octavia:
status:	New → In Progress
importance:	Undecided → Critical
assignee:	nobody → Frode Nordahl (fnordahl)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-31: Fix proposed to charm-octavia (master)

#4

Fix proposed to branch: master
Review: https://review.opendev.org/744099

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-02: Fix merged to charm-octavia (master)

#5

Reviewed: https://review.opendev.org/744099
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=68536aee5aeedfb0c62ba2a115be5512bd46f37c
Submitter: Zuul
Branch: master

commit 68536aee5aeedfb0c62ba2a115be5512bd46f37c
Author: Frode Nordahl <email address hidden>
Date: Fri Jul 31 09:20:28 2020 +0200

Add back `worker-multiplier`

Change I4f3eb1c14caf35a4b670a27d3b599edab83b1378 erronously
removed the `worker-multiplier` configuration option.

Change-Id: I3b5dd0f8fae91a8643b0c74d9054587706fadf1a
Closes-Bug: #1889731

Changed in charm-octavia:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-03: Fix proposed to charm-octavia (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/744402

Frode Nordahl (fnordahl) on 2020-08-03

Changed in charm-octavia:
importance:	Critical → High
status:	Fix Committed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-08: Fix merged to charm-octavia (master)

#7

Reviewed: https://review.opendev.org/744402
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16
Submitter: Zuul
Branch: master

commit 4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16
Author: Nobuto Murata <email address hidden>
Date: Mon Aug 3 15:29:07 2020 +0900

Set up health_manager processes based on worker-multiplier

Properly limit the number of processes with worker-multiplier instead of
spawning as many workers as available CPU threads.

Change-Id: I7f42e131d7de4a58a926b634950969e6f406bb10
Closes-Bug: #1889731

Changed in charm-octavia:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-14: Fix proposed to charm-octavia (stable/20.08)

#8

Fix proposed to branch: stable/20.08
Review: https://review.opendev.org/758038

Felipe Reyes (freyes) on 2020-10-14

tags:	added: sts
tags:	added: canonical-bootstack

Revision history for this message

Trent Lloyd (lathiat) wrote on 2020-10-15:

#9

Download full text (4.2 KiB)

Also saw this issue manifest as an actual problem in octavia-health-manager. It was stuck, not working, outputting the following errors:

2020-10-13 06:25:15.184 1728 ERROR octavia.db.api [-] Connection to database failed. Retrying in 10 seconds.: sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)
[backtrace omitted]
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api The above exception was the direct cause of the following exception:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api (Background on this error at: http://sqlalche.me/e/e3q8)
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api During handling of the above exception, another exception occurred:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)

I found the following:
sqlalchemy/oslo.db keeps a pool of 10 connections active at a time to reduce db connection latency. It also allows an overflow of 10 more (20 total) if all the connections are simultaneously in-use by a thread, but won't allow any more even if requested.

Octavia is multi-threaded with eventlet/greenlet. By default health_update_threads and stats_update_threads are both equal to the number of CPU threads. Most of the time they will only use the DB connection for a short period (so will multiplex their use of the smaller number of connections) but various factors may cause them to hold onto the connection for a while or all have work to do at the same time.

Most charms limit the number of worker threads to to 0.25x threads by the worker-multiplier config [and, I think, max 4 inside lXD], charm-octavia had this bug where it didn't do that

Octavia hit error "Lost connection to MySQL server during query" [this error can also happen if the server connection was lost before the query was sent]. Sometimes this is caused by aggressive wait-timeout=180 on the server but here we have wait-timeout=3600 so likely I suspect that there was a MySQL server failover which will cause the old connection to go stale as the new VIP owner didn't have the old TCP connections.

sqlalchemy normally sees such an error and automatically reconnects. However in doing so it got the additional error: "QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)" as a direct result of trying to handle the original "Lost connection" error.

I suspect this is actually a limitation in sqlalchemy in that if a connection has an error, and it wants to reconnect, rather than "replace" the existing connection it tries to get a new connection in the pool...

Also saw this issue manifest as an actual problem in octavia-health-manager. It was stuck, not working, outputting the following errors:

2020-10-13 06:25:15.184 1728 ERROR octavia.db.api [-] Connection to database failed. Retrying in 10 seconds.: sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)
[backtrace omitted]
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api The above exception was the direct cause of the following exception:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api (Background on this error at: http://sqlalche.me/e/e3q8)
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api During handling of the above exception, another exception occurred:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)

I found the following:
sqlalchemy/oslo.db keeps a pool of 10 connections active at a time to reduce db connection latency. It also allows an overflow of 10 more (20 total) if all the connections are simultaneously in-use by a thread, but won't allow any more even if requested.

Octavia is multi-threaded with eventlet/greenlet. By default health_update_threads and stats_update_threads are both equal to the number of CPU threads. Most of the time they will only use the DB connection for a short period (so will multiplex their use of the smaller number of connections) but various factors may cause them to hold onto the connection for a while or all have work to do at the same time.

Most charms limit the number of worker threads to to 0.25x threads by the worker-multiplier config [and, I think, max 4 inside lXD], charm-octavia had this bug where it didn't do that

Octavia hit error "Lost connection to MySQL server during query" [this error can also happen if the server connection was lost before the query was sent]. Sometimes this is caused by aggressive wait-timeout=180 on the server but here we have wait-timeout=3600 so likely I suspect that there was a MySQL server failover which will cause the old connection to go stale as the new VIP owner didn't have the old TCP connections.

sqlalchemy normally sees such an error and automatically reconnects. However in doing so it got the additional error: "QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)" as a direct result of trying to handle the original "Lost connection" error.

I suspect this is actually a limitation in sqlalchemy in that if a connection has an error, and it wants to reconnect, rather than "replace" the existing connection it tries to get a new connection in the pool before removing the old one, but can't because we hit the limit. I would argue this is a bug but I can't find any good references, but I did find this note that the overlimit limit should be >= total number of threads which is unlikely the case here due to the above bug: https://github.com/sqlalchemy/sqlalchemy/issues/5308

There is also a case for getting this QueuePool limit under normal operation, but I didn't see that here and I haven't seen any other bugs or reports of that. So my guess is that all 20 connections are stuck in use by a thread trying to get that connection to reconnect and failing. This is a kind of "dead lock". After restarting octavia-health-manager it reconnected and resumed opertaion normally after failing like that for at least a day.

In any case this fix will likely prevent the issue since we'll likely have less than 20 threads with this change in place. But I will also open another bug to if necessary raise the DB threadpool limit to at least match the expected number of threads.

Documenting here mostly so people can find this bug if they search for the relevant error messages.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-10-21: Fix merged to charm-octavia (stable/20.08)

#10

Reviewed: https://review.opendev.org/758038
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=741f43238cf3f0be569571af49de78d87288746f
Submitter: Zuul
Branch: stable/20.08

commit 741f43238cf3f0be569571af49de78d87288746f
Author: Nobuto Murata <email address hidden>
Date: Mon Aug 3 15:29:07 2020 +0900

Set up health_manager processes based on worker-multiplier

Properly limit the number of processes with worker-multiplier instead of
spawning as many workers as available CPU threads.

    Change-Id: I7f42e131d7de4a58a926b634950969e6f406bb10
    Closes-Bug: #1889731
    (cherry picked from commit 4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16)

Alex Kavanagh (ajkavanagh) on 2020-11-02

Changed in charm-octavia:
milestone:	none → 20.10

Alex Kavanagh (ajkavanagh) on 2020-11-02

Changed in charm-octavia:
status:	Fix Committed → Fix Released

OpenStack Octavia Charm

charm-octavia doesn't apply worker-multiplier thus takes up available MySQL connections

Bug Description

Other bug subscribers

Remote bug watches