minor galera connection close/timeout/pooling issue at haproxy or olso.db
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack-Ansible |
Incomplete
|
Undecided
|
Damian Dąbrowski |
Bug Description
setup:
using xena and pretty much default settings.
so openstack_
symptom:
seeing galera connection aborts reported in haproxy in ERSP column. In the mariadb log I get lines like:
"Aborted connection 594171 to db: 'placement' user: 'placement' host: 'hostA.
Also aborted connections counter is rising in mariadb.
Such errors cause retries on openstack side causing things to go slow from time to time.
Of course after wait_timeout period in idle.
expectation:
not getting those kind of errors
some analysis:
maria db is actually dropping the connections at wait_timeout (=galera_
oslo.db config used in basically all openstack services is doing some connection pooling and is configured (e.g. in placement) with the following values (all default):
max_overflow = 50
max_pool_size = 5
pool_timeout = 30
connection_
So it should actually close connections and re-establish them before the timeout.
also haproxy using timeouts with 5000s in frontend and backend should not matter here.
not a solution:
increasing the wait_timeout in mariadb to 1200 or 3600.
(workaround) solution but may not be a good one:
increasing the wait_timeout in mariadb to 7200.
I am not sure where the issue is actually comming from but here are my best guesses:
* there is a bug in openstack end not setting the config values in lower layer library
* there is some bug in the sql db facing lib code causing pooling and refresh not to work properly.
* the timeout in mariadb must be higher then in oslo.db
* haproxy may still cause some issue here and the 5000s may be part of that.
impact:
mostly annoying errors causing retries and slowing things down without any big impact.
so i consider this a minor bug
description: | updated |
Changed in openstack-ansible: | |
assignee: | nobody → Damian Dąbrowski (damiandabrowski) |
Hi Alexander, thanks for the report.
https:/ /docs.sqlalchem y.org/en/ 14/core/ pooling. html#setting- pool-recycle /docs.openstack .org/oslo. db/latest/ reference/ opts.html# database. connection_ recycle_ time
https:/
I'm not a database expert but if I understand it correctly, connection_ recycle_ time=600 does not guarantee that connection will be closed exactly after 10 minutes of inactivity.
It means that during next checkout(i.e. when something will try to re-use that connection), connection will be verified and optionally replaced(recycled).
So increasing wait_timeout may help to reduce the amount of these warnings but can't guarantee they will completely disappear. It's because even we set wait_timeout=7200 and connection_ recycle_ time=600, connection may not be checked out by anything before wait_timeout.
Do You agree with me so far? :D (I'm trying to confirm I understand it correctly).
Additionally, why do You think these warnings are actually slowing things down?