Magnum

Bug #2067345
Comment #12

Comment 12 for bug 2067345

Revision history for this message

Michel Jouvin (mijouvin) wrote on 2024-07-06 (last edit on 2024-07-06):

#12

Additional information:

We have 2 clouds, the production one running Antelope, including Magnum and a test one running Antelope, except Magnum/Heat/Barbican running Caracal last version (RDO rpms). As far as the SQL/MariaDB connections from Magnum are concerned:

- Production cloud: 18 Magnum (K8s) clusters, number of active SQL connections from Magnum: 40 (thus fits in the default max_pool_size=1 which includes an overflow of 50)
- Test cloud : 9 Magnum (K8s) clusters, number of active SQL connections from Magnum: 274. It is the number 2 hours after restarting the openstack-magnum-conductor service.

In both cases, all the connections are in the ESTABLISHED state.

For some reasons, it seems that a lot of connections are not closed. Using lsof to get pid/fd associated with each connection and stat to get the Change/Modify/Access dates of each ones, I observed the following: See https://paste.sh/A1Z4gquN#KZX9dOwAXgJO3n_9yDj7wDPc

- production cloud: all the connections have the same Access time (despite a different one for Change/Modify), about 2 hours ago
- test cloud: depending on the connections the Access time may be different. For many connections, the Access time is the same (or very close to) the Change/Modify time (that tends to be identical). If I select only those with a recent Access time (1/2h ago), I find only 45 connections that seem a more reasonable number... See https://paste.sh/yJANHwL5#_OC9uX7fwh0S7_4g75AHQnzH

This problem is blocking our upgrade to Caracal that we would like to do asap to support recent K8s versions... As said previously, even setting max_pool_size=0 is not enough to prevent magnum-conductor to fail at some point with the exception mentioned in the first post.

Michel