charm-octavia doesn't apply worker-multiplier thus takes up available MySQL connections

Bug #1889731 reported by Nobuto Murata
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Octavia Charm
Fix Released
High
Frode Nordahl

Bug Description

In a deployed cloud, octavia units have unnecessarily large number of workers running (163) and database connections (90) per unit of the box when idle.

The service can eat up available MySQL connections as other services are limiting the number of connections by following the worker-multiplier configuration.

$ juju config octavia worker-multiplier
0.25
-> 20 is the expected worker number as I have 80 CPU threads system

# pgrep -af /usr/bin/octavia-health-manager | wc -l
163

# ss -tp | grep -c :mysql
90
(when idle)

Revision history for this message
Nobuto Murata (nobuto) wrote :

I think the section below needs to be applied to the template especially health_update_threads and stats_update_threads.

openstack/octavia (master=)$ git grep -C 7 processors etc/octavia.conf
etc/octavia.conf-
etc/octavia.conf-[health_manager]
etc/octavia.conf-# bind_ip = 127.0.0.1
etc/octavia.conf-# bind_port = 5555
etc/octavia.conf-# controller_ip_port_list example: 127.0.0.1:5555, 127.0.0.1:5555
etc/octavia.conf-# controller_ip_port_list =
etc/octavia.conf-# failover_threads = 10
etc/octavia.conf:# health_update_threads will default to the number of processors on the host
etc/octavia.conf-# health_update_threads =
etc/octavia.conf:# stats_update_threads will default to the number of processors on the host
etc/octavia.conf-# stats_update_threads =
etc/octavia.conf-# heartbeat_interval = 10
etc/octavia.conf-# heartbeat_key =
etc/octavia.conf-# heartbeat_timeout = 60
etc/octavia.conf-# health_check_interval = 3
etc/octavia.conf-# sock_rlimit = 0
etc/octavia.conf-

Revision history for this message
Nobuto Murata (nobuto) wrote :

I'm on the stable release of the charm so I see the worker-multiplier option.
https://jaas.ai/octavia/21#charm-config-worker-multiplier

However, it has been removed in the master somehow without a clear explanation.
https://review.opendev.org/#/c/728769/

We definitely need that option in charm-octavia.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high.

The documented functionality as worker-multiplier is not working and we don't have a controller over the number of processes to be spawned. 480 connections (80 threads system * 2 services * 3 units) just with Octavia is large enough to reach to the limit of MySQL connections in production.

Frode Nordahl (fnordahl)
Changed in charm-octavia:
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Frode Nordahl (fnordahl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744099

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/744099
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=68536aee5aeedfb0c62ba2a115be5512bd46f37c
Submitter: Zuul
Branch: master

commit 68536aee5aeedfb0c62ba2a115be5512bd46f37c
Author: Frode Nordahl <email address hidden>
Date: Fri Jul 31 09:20:28 2020 +0200

    Add back `worker-multiplier`

    Change I4f3eb1c14caf35a4b670a27d3b599edab83b1378 erronously
    removed the `worker-multiplier` configuration option.

    Change-Id: I3b5dd0f8fae91a8643b0c74d9054587706fadf1a
    Closes-Bug: #1889731

Changed in charm-octavia:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744402

Frode Nordahl (fnordahl)
Changed in charm-octavia:
importance: Critical → High
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/744402
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16
Submitter: Zuul
Branch: master

commit 4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16
Author: Nobuto Murata <email address hidden>
Date: Mon Aug 3 15:29:07 2020 +0900

    Set up health_manager processes based on worker-multiplier

    Properly limit the number of processes with worker-multiplier instead of
    spawning as many workers as available CPU threads.

    Change-Id: I7f42e131d7de4a58a926b634950969e6f406bb10
    Closes-Bug: #1889731

Changed in charm-octavia:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-octavia (stable/20.08)

Fix proposed to branch: stable/20.08
Review: https://review.opendev.org/758038

Felipe Reyes (freyes)
tags: added: sts
tags: added: canonical-bootstack
Revision history for this message
Trent Lloyd (lathiat) wrote :
Download full text (4.2 KiB)

Also saw this issue manifest as an actual problem in octavia-health-manager. It was stuck, not working, outputting the following errors:

2020-10-13 06:25:15.184 1728 ERROR octavia.db.api [-] Connection to database failed. Retrying in 10 seconds.: sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)
[backtrace omitted]
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api The above exception was the direct cause of the following exception:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api (Background on this error at: http://sqlalche.me/e/e3q8)
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api During handling of the above exception, another exception occurred:
2020-10-13 06:25:15.184 1728 ERROR octavia.db.api sqlalchemy.exc.TimeoutError: QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)

I found the following:
sqlalchemy/oslo.db keeps a pool of 10 connections active at a time to reduce db connection latency. It also allows an overflow of 10 more (20 total) if all the connections are simultaneously in-use by a thread, but won't allow any more even if requested.

Octavia is multi-threaded with eventlet/greenlet. By default health_update_threads and stats_update_threads are both equal to the number of CPU threads. Most of the time they will only use the DB connection for a short period (so will multiplex their use of the smaller number of connections) but various factors may cause them to hold onto the connection for a while or all have work to do at the same time.

Most charms limit the number of worker threads to to 0.25x threads by the worker-multiplier config [and, I think, max 4 inside lXD], charm-octavia had this bug where it didn't do that

Octavia hit error "Lost connection to MySQL server during query" [this error can also happen if the server connection was lost before the query was sent]. Sometimes this is caused by aggressive wait-timeout=180 on the server but here we have wait-timeout=3600 so likely I suspect that there was a MySQL server failover which will cause the old connection to go stale as the new VIP owner didn't have the old TCP connections.

sqlalchemy normally sees such an error and automatically reconnects. However in doing so it got the additional error: "QueuePool limit of size 10 overflow 20 reached, connection timed out, timeout 10 (Background on this error at: http://sqlalche.me/e/3o7r)" as a direct result of trying to handle the original "Lost connection" error.

I suspect this is actually a limitation in sqlalchemy in that if a connection has an error, and it wants to reconnect, rather than "replace" the existing connection it tries to get a new connection in the pool...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (stable/20.08)

Reviewed: https://review.opendev.org/758038
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=741f43238cf3f0be569571af49de78d87288746f
Submitter: Zuul
Branch: stable/20.08

commit 741f43238cf3f0be569571af49de78d87288746f
Author: Nobuto Murata <email address hidden>
Date: Mon Aug 3 15:29:07 2020 +0900

    Set up health_manager processes based on worker-multiplier

    Properly limit the number of processes with worker-multiplier instead of
    spawning as many workers as available CPU threads.

    Change-Id: I7f42e131d7de4a58a926b634950969e6f406bb10
    Closes-Bug: #1889731
    (cherry picked from commit 4ed5caf77e675bbfc98cc5f1e45bcfa250bfab16)

Changed in charm-octavia:
milestone: none → 20.10
Changed in charm-octavia:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.