OpenStack-Ansible

minor galera connection close/timeout/pooling issue at haproxy or olso.db

Bug #1981463 reported by Alexander Binzxxxxxx on 2022-07-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack-Ansible	Incomplete	Undecided	Damian Dąbrowski

Bug Description

setup:
using xena and pretty much default settings.
so openstack_db_connection_recycle_time is 600 and galera_wait_timeout as well while timeout in haproxy for galera frontend/backend is 5000s

symptom:
seeing galera connection aborts reported in haproxy in ERSP column. In the mariadb log I get lines like:
"Aborted connection 594171 to db: 'placement' user: 'placement' host: 'hostA.mydomain.com' (Got timeout reading communication packets)"
Also aborted connections counter is rising in mariadb.
Such errors cause retries on openstack side causing things to go slow from time to time.
Of course after wait_timeout period in idle.

expectation:
not getting those kind of errors

some analysis:
maria db is actually dropping the connections at wait_timeout (=galera_wait_timeout=600) due to connection beeing idle for a long time.
oslo.db config used in basically all openstack services is doing some connection pooling and is configured (e.g. in placement) with the following values (all default):
max_overflow = 50
max_pool_size = 5
pool_timeout = 30
connection_recycle_time = 600
So it should actually close connections and re-establish them before the timeout.
also haproxy using timeouts with 5000s in frontend and backend should not matter here.

not a solution:
increasing the wait_timeout in mariadb to 1200 or 3600.

(workaround) solution but may not be a good one:
increasing the wait_timeout in mariadb to 7200.

I am not sure where the issue is actually comming from but here are my best guesses:
* there is a bug in openstack end not setting the config values in lower layer library
* there is some bug in the sql db facing lib code causing pooling and refresh not to work properly.
* the timeout in mariadb must be higher then in oslo.db
* haproxy may still cause some issue here and the 5000s may be part of that.

impact:
mostly annoying errors causing retries and slowing things down without any big impact.
so i consider this a minor bug

See original description

Alexander Binzxxxxxx (devil000000) on 2022-07-12

description:

updated

Dmitriy Rabotyagov (noonedeadpunk) on 2022-07-19

Changed in openstack-ansible:
assignee:	nobody → Damian Dąbrowski (damiandabrowski)

Revision history for this message

Damian Dąbrowski (damiandabrowski) wrote on 2022-07-20:

Hi Alexander, thanks for the report.

https://docs.sqlalchemy.org/en/14/core/pooling.html#setting-pool-recycle
https://docs.openstack.org/oslo.db/latest/reference/opts.html#database.connection_recycle_time

I'm not a database expert but if I understand it correctly, connection_recycle_time=600 does not guarantee that connection will be closed exactly after 10 minutes of inactivity.
It means that during next checkout(i.e. when something will try to re-use that connection), connection will be verified and optionally replaced(recycled).

So increasing wait_timeout may help to reduce the amount of these warnings but can't guarantee they will completely disappear. It's because even we set wait_timeout=7200 and connection_recycle_time=600, connection may not be checked out by anything before wait_timeout.

Do You agree with me so far? :D (I'm trying to confirm I understand it correctly).

Additionally, why do You think these warnings are actually slowing things down?

Revision history for this message

Alexander Binzxxxxxx (devil000000) wrote on 2022-07-20 (last edit on 2022-07-20):

When reeding those docs I had a similar understanding. The connection_recycle_time is not guaranteed to happen after that time. I did however expect it to get recycled before 2x the configured time.
I agree that this is a possibility and a possible interpretation but I do not know enough about its implementation to tell if this understanding is correct. Also I think this should not be the behaviour in such a case.

About the slowness:
When I used horizon again to create a VM after wait_timeout basically killed all idle connections operations appeared to be very slow for the first thing done by "each" os service.
I guessed this is because a killed connection is getting fetched from the connection pool of each service, tried, failed, reconnected (maybe after some backoff time of about 1s), and then the operation was retried.
I am not sure if this is actually what happens but it apears to me likely.
In case my understanding here is correct - I would have expected the connection getting fetched from the pool getting reconnected due to expired connection_recycle_time and not used, failed, then reconnected and retried.

Since this is the case for every os service involved the delay adds up

Revision history for this message

Damian Dąbrowski (damiandabrowski) wrote on 2022-10-21:

Hi again!

I'm so sorry for this huge delay, I was overwhelmed by other things.

I spent some time on this issue today and here are my findings:

1. These "Aborted connection .* (Got timeout reading communication packets)" occurrences are just warnings about connections aborted by mariadb due to the wait_timeout. I see no reason how the could be harmful.

2. As you mentioned, increasing wait_timeout is only a workaround which may prevent these warnings but may lead to other issues. For example if haproxy fails over between galera nodes(due to network issues etc.), galera nodes will leave a lot of "stale" connections for a long time due to the high wait_timeout. In that case you may easily exceed connection limit in haproxy. I believe that was the main reason why we decided to decrease wait_timeout.

3. I tried to reproduce your issue, so basically I set `openstack_db_connection_recycle_time=60` (it also affects wait_timeout). Then I prepared a script which concurrently executes "list servers" API call 30 times, waits 65 seconds(to let mariadb abort the connections) and do the same again 4 more times.
I didn't notice any performance difference between first attempt and the rest.
Script I used: https://paste.openstack.org/show/bMlSdIiAfVXGkiyyjvI1/
Output: https://paste.openstack.org/show/bOVA9iBfSscvLk1YzUz2/

So maybe your problem is somewhere else? memcached? Maybe you reached max_overflow or some other limit? Hard to say with the provided information :/

Changed in openstack-ansible:
status:	New → Incomplete

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.