Error with QueuePool limit of size

Bug #2067345 reported by Khoi
32
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Magnum
New
Undecided
Unassigned

Bug Description

Hello.

I got this problem when I create and delete cluster about 15 times repeatedly :

Ignore error [QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)] when syncing up cluster status.: sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00

This is full log:

https://paste.sh/gWg28T2s#zctJBJsW34bX9pG705VR5PMP

Openstack:2024.01
Ubuntu: 22.04
Deployment Tool: Kolla Ansible
Driver: Magnum-Cluster-API(0.17.0) from Vexxhost

Thank you much.

Revision history for this message
Khoi (khoinh5) wrote :

Hello.
I believe this is a critical bug. Because I created only 1 cluster and It still happened after 1 day without touch anything.

Revision history for this message
rasty94 (rasty94) wrote (last edit ):

I have the same problem with this version:

Openstack:2024.01
Rocky 9
Deployment Tool: Kolla Ansible

we can achieve deploy with magnum increasing the limits on the queue:
[database]
connection_recycle_time = 10
max_pool_size = 50
max_overflow = 100
max_retries = -1
pool_timeout = 180

But this is not a solution...

Revision history for this message
Khoi (khoinh5) wrote :

Thank you for your sharing.

I wonder why It is ok with 2023.01. I wont need tune these parameters.

Revision history for this message
rasty94 (rasty94) wrote :

i was reviewing was has changed from 2023.1 to 2024.1 in the perodic.py but i can't see anything that could be this problem.

Only added:
            # Clean up trusts and certificates, if they still exist.
            os_client = clients.OpenStackClients(self.ctx)
            LOG.debug("Calling delete_trustee_and_trusts from periodic "
                      "DELETE_COMPLETE")
            trust_manager.delete_trustee_and_trust(os_client, self.ctx,
                                                   self.cluster)
            cert_manager.delete_certificates_from_cluster(self.cluster,
                                                          context=self.ctx)

Revision history for this message
Khoi (khoinh5) wrote :

Oh.

I hope there will some advices for this. If tunning these parameters so I will continue to QueuePool limit of size after a period of time.

Revision history for this message
Khoi (khoinh5) wrote :

Which Magnum driver did u use?

Revision history for this message
Khoi (khoinh5) wrote :

I increased max_pool_size = 2 and I still happened again in two days,

Revision history for this message
Michel Jouvin (mijouvin) wrote :

Hi,

I am experiencing the same problem with Caracal Magnum (Kubernetes driver) and I have max_pool_size set to 5. I'm running AlmaLinux 9.4.

The problem happens on a test instance with a very limited number of clusters where I have not seen it on the production Magnum instance running Antelope with ~20 clusters active.

Michel

Revision history for this message
Michel Jouvin (mijouvin) wrote :

In fact setting max_pool_sie to 0 (no limit) seems to workaround the problem. Not sure what it means in the long run...

Michel

Revision history for this message
Khoi (khoinh5) wrote : Re: [Bug 2067345] Re: Error with QueuePool limit of size

Hi.
I think i will make system crash. I hope that we will have a good sulution
for it.

On Sun, Jun 30, 2024, 9:20 PM Michel Jouvin <email address hidden>
wrote:

> In fact setting max_pool_sie to 0 (no limit) seems to workaround the
> problem. Not sure what it means in the long run...
>
> Michel
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/2067345
>
> Title:
> Error with QueuePool limit of size
>
> Status in Magnum:
> New
>
> Bug description:
> Hello.
>
> I got this problem when I create and delete cluster about 15 times
> repeatedly :
>
> Ignore error [QueuePool limit of size 1 overflow 50 reached,
> connection timed out, timeout 30.00 (Background on this error at:
> https://sqlalche.me/e/14/3o7r)] when syncing up cluster status.:
> sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50
> reached, connection timed out, timeout 30.00
>
> This is full log:
>
> https://paste.sh/gWg28T2s#zctJBJsW34bX9pG705VR5PMP
>
> Openstack:2024.01
> Ubuntu: 22.04
> Deployment Tool: Kolla Ansible
> Driver: Magnum-Cluster-API(0.17.0) from Vexxhost
>
> Thank you much.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/magnum/+bug/2067345/+subscriptions
>
>

Revision history for this message
Michel Jouvin (mijouvin) wrote :

FYI, setting `max_pool_size` to 0 just delays the problem. At some point there is a very high number of open connections from MAgnum to the MariaDB server and the errors appears again (may be some ressources are exhausted on the server side) and the only solution is to restart the magnum-conductor service. Last occurence of the problem the number of open connection was ~385 on a test cluster with less than 10 clusters...

Would lowering conn_pool_ttl help to recycle connections?

Michel

Revision history for this message
Michel Jouvin (mijouvin) wrote (last edit ):

Additional information:

We have 2 clouds, the production one running Antelope, including Magnum and a test one running Antelope, except Magnum/Heat/Barbican running Caracal last version (RDO rpms). As far as the SQL/MariaDB connections from Magnum are concerned:

- Production cloud: 18 Magnum (K8s) clusters, number of active SQL connections from Magnum: 40 (thus fits in the default max_pool_size=1 which includes an overflow of 50)
- Test cloud : 9 Magnum (K8s) clusters, number of active SQL connections from Magnum: 274. It is the number 2 hours after restarting the openstack-magnum-conductor service.

In both cases, all the connections are in the ESTABLISHED state.

For some reasons, it seems that a lot of connections are not closed. Using lsof to get pid/fd associated with each connection and stat to get the Change/Modify/Access dates of each ones, I observed the following: See https://paste.sh/A1Z4gquN#KZX9dOwAXgJO3n_9yDj7wDPc

- production cloud: all the connections have the same Access time (despite a different one for Change/Modify), about 2 hours ago
- test cloud: depending on the connections the Access time may be different. For many connections, the Access time is the same (or very close to) the Change/Modify time (that tends to be identical). If I select only those with a recent Access time (1/2h ago), I find only 45 connections that seem a more reasonable number... See https://paste.sh/yJANHwL5#_OC9uX7fwh0S7_4g75AHQnzH

This problem is blocking our upgrade to Caracal that we would like to do asap to support recent K8s versions... As said previously, even setting max_pool_size=0 is not enough to prevent magnum-conductor to fail at some point with the exception mentioned in the first post.

Michel

Revision history for this message
Michel Jouvin (mijouvin) wrote :

I checked this morning magnum-conductor status: it is failing with the exception mentioned in this issue, despite max_pool_size=0:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 0 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)

which is a bit strange as 0 should mean no limit.... Looking at the MariaDB server log, there are a lot of

2024-07-07 23:59:43 544875 [Warning] Aborted connection 544875 to db: 'magnum' user: 'magnum' host: 'os-77023-admin' (Got timeout reading communication packets)

followed by:

Status information:
Current dir: /var/lib/mysql/
Running threads: 607 Cached threads: 9 Stack size: 299008

The number of thread looks very high for a test cloud with a very low activity...

On the Magnum server, counting the number of established connections the same way as previously, there are only 135 connections, a number quite low compared to what has been observed previously. There are 33 magnum-conductor processes running.

Michel

Revision history for this message
Jake Yip (waipengyip) wrote :

Hi, I am still at Antelope / Bobcat, so I haven't encountered this. That being said, Caracal is in the plans, so I may come back with more information when I manage to upgrade to it.

Revision history for this message
Michel Jouvin (mijouvin) wrote :

Hi,

I found some additional information that me be useful so I add it.

- On the Antelope cloud, the number of open SQL connections by magnum-conductor processes is the number of active cluster +1 and remains stable. And there is 1 open SQL connection per uwsgi process running the API (8).
- On the Caracal cluster, the number of open SQL connections by magnum-conductor processes is much higher than the number of clusters (typically 2x but may be more). And the uwsgi processes running the API (8 processes) has also a very high number of open SQL connections (currently 54).

Michel

Revision history for this message
rasty94 (rasty94) wrote :

Hello,

Internally we are doing some promising tests, as we have seen a basic "problem" in how database access is handled with SQLAlchemy, as the module uses version 1.4.51 and yet the oslo_db in caracal has been modified to remove all the autocommit part (it is not compatible with SQLAlchemy 2) and this causes many sessions to be left unclosed.
We have modified the following files internally to reverse the autocommit issue and we see that the sessions against the dbd (with several clusters mounted) is greatly reduced (from over 120 to 20-30):
/var/lib/lib/kolla/venv/lib/python3.9/site-packages/oslo_db/sqlalchemy/enginefacade.py
/var/lib/kolla/venv/lib/python3.9/site-packages/magnum/db/sqlalchemy/api.py
(We modify directly on the container that has version 15.0.0.0 of oslo.db)

We are going to do more tests and if we see that everything works correctly, we propose the change.

If this is the case, it will affect multiple caracal modules that are not using SQLAlchemy 2 and yet have modified their version of oslo.db.

Revision history for this message
Michel Jouvin (mijouvin) wrote :

Hi,

Thanks for these encouraging news. I will be away a few weeks but will be happy to test the proposed changes once I'm back!

Revision history for this message
Jake Yip (waipengyip) wrote :

@rasty94 that is encouraging news indeed! Await your patches.

Revision history for this message
rasty94 (rasty94) wrote :

Attached is a small zipped file with the different mods applied on the images I have generated from magnum (api and driver).
I explain briefly the 3 modifications made:
1- Modification of the default kubecluster template, as it did not include the correct default versions for 2024.1 (caracal).
2- Modification/reversal of the changes made to oslo.db(enginefacade.py) and the sqlalchemy api in magnum to add back the autocommit parameter. This is because the version of SQLAlchemy is 1.4.51 instead of 2.x.x , which means that it is still advisable to parameterise as before 2024.1 to make oslo.db more compatible with lower versions of SQLAlchemy, as several considerations have been removed in this version of oslo.db (it may affect more openstack modules, we are checking).
3- Removal of the magnum-cluster-api package, in certain cases, it has been causing crashes of the container and for our particular use, it is not necessary for the moment.

Best Regards

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.