Bug #1901898 “Too many open files after rebooting each controlle...” : Bugs : kolla-ansible

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-10-28:

#1

Try running deploy again indeed. It is largely idempotent. I have never seen such an issue.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2020-10-29:

#2

Could you provide some logs showing the context of the error messages?

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-09:

#3

Download full text (9.6 KiB)

This is what I see when I check if the issue is happening. I use `docker logs memcached`. I see some variation of this across all of the docker containers.

Just making sure, did you happen to deploy on CentOS 8? Just wondering if it is related to that.

Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
getpeername: Transport endpoint is not connected
Failed to write, and not due to blocking: Broken pipe
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections

Here is a Keystone example as soon as that starts to happen as well.

keystone-apache-public-error.log:2020-11-05 11:25:52.261790 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
keystone-apache-public-error.log:2020-11-05 11:25:52.261793 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
keystone-apache-public-error.log:2020-11-05 11:25:52.261794 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines
keystone-apache-public-error.log:2020-11-05 11:25:52.261796 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines The above exception was the direct cause of the following exception:
keystone-apache-public-error.log:2020-11-05 11:25:52.261798 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines
keystone-apache-public-error.log:2020-11-05 11:25:52.261799 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
keystone-apache-public-error.log:2020-11-05 11:25:52.261801 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
keystone-apache-public-error.log:2020-11-05 11:25:52.261803 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))
keystone-apache-public-error.log:2020-11-05 11:25:52.261805 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 914, in scalar
keystone-apache-public-error.log:2020-11-05 11:25:52.261807 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()
keystone-apache-public-error.log:2020-11-05 11:25:52.261809 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 984, in execute
keystone-apache-public-error.log:2020-11-05 11:25:52.261813 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)
keystone-apache-public-error.log:2020-11-05 11:25:52.261814 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/...

This is what I see when I check if the issue is happening. I use `docker logs memcached`. I see some variation of this across all of the docker containers.

Just making sure, did you happen to deploy on CentOS 8? Just wondering if it is related to that.

Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
getpeername: Transport endpoint is not connected
Failed to write, and not due to blocking: Broken pipe
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections

Here is a Keystone example as soon as that starts to happen as well.

keystone-apache-public-error.log:2020-11-05 11:25:52.261790 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
keystone-apache-public-error.log:2020-11-05 11:25:52.261793 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
keystone-apache-public-error.log:2020-11-05 11:25:52.261794 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines 
keystone-apache-public-error.log:2020-11-05 11:25:52.261796 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines The above exception was the direct cause of the following exception:
keystone-apache-public-error.log:2020-11-05 11:25:52.261798 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines 
keystone-apache-public-error.log:2020-11-05 11:25:52.261799 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
keystone-apache-public-error.log:2020-11-05 11:25:52.261801 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
keystone-apache-public-error.log:2020-11-05 11:25:52.261803 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     connection.scalar(select([1]))
keystone-apache-public-error.log:2020-11-05 11:25:52.261805 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 914, in scalar
keystone-apache-public-error.log:2020-11-05 11:25:52.261807 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     return self.execute(object_, *multiparams, **params).scalar()
keystone-apache-public-error.log:2020-11-05 11:25:52.261809 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 984, in execute
keystone-apache-public-error.log:2020-11-05 11:25:52.261813 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     return meth(self, multiparams, params)
keystone-apache-public-error.log:2020-11-05 11:25:52.261814 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
keystone-apache-public-error.log:2020-11-05 11:25:52.261816 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     return connection._execute_clauseelement(self, multiparams, params)
keystone-apache-public-error.log:2020-11-05 11:25:52.261818 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1103, in _execute_clauseelement
keystone-apache-public-error.log:2020-11-05 11:25:52.261820 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     distilled_params,
keystone-apache-public-error.log:2020-11-05 11:25:52.261822 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1288, in _execute_context
keystone-apache-public-error.log:2020-11-05 11:25:52.261824 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     e, statement, parameters, cursor, context
keystone-apache-public-error.log:2020-11-05 11:25:52.261826 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1479, in _handle_dbapi_exception
keystone-apache-public-error.log:2020-11-05 11:25:52.261828 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     util.raise_(newraise, with_traceback=exc_info[2], from_=e)
keystone-apache-public-error.log:2020-11-05 11:25:52.261830 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
keystone-apache-public-error.log:2020-11-05 11:25:52.261832 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     raise exception
keystone-apache-public-error.log:2020-11-05 11:25:52.261834 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
keystone-apache-public-error.log:2020-11-05 11:25:52.261836 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     cursor, statement, parameters, context
keystone-apache-public-error.log:2020-11-05 11:25:52.261837 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
keystone-apache-public-error.log:2020-11-05 11:25:52.261840 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     cursor.execute(statement, parameters)
keystone-apache-public-error.log:2020-11-05 11:25:52.261841 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute
keystone-apache-public-error.log:2020-11-05 11:25:52.261843 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     result = self._query(query)
keystone-apache-public-error.log:2020-11-05 11:25:52.261845 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query
keystone-apache-public-error.log:2020-11-05 11:25:52.261847 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     conn.query(q)
keystone-apache-public-error.log:2020-11-05 11:25:52.261849 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 517, in query
keystone-apache-public-error.log:2020-11-05 11:25:52.261851 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
keystone-apache-public-error.log:2020-11-05 11:25:52.261853 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 732, in _read_query_result
keystone-apache-public-error.log:2020-11-05 11:25:52.261855 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     result.read()
keystone-apache-public-error.log:2020-11-05 11:25:52.261856 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 1075, in read
keystone-apache-public-error.log:2020-11-05 11:25:52.261860 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     first_packet = self.connection._read_packet()
keystone-apache-public-error.log:2020-11-05 11:25:52.261861 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 657, in _read_packet
keystone-apache-public-error.log:2020-11-05 11:25:52.261863 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     packet_header = self._read_bytes(4)
keystone-apache-public-error.log:2020-11-05 11:25:52.261865 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines   File "/var/lib/kolla/venv/lib/python3.6/site-packages/pymysql/connections.py", line 707, in _read_bytes
keystone-apache-public-error.log:2020-11-05 11:25:52.261867 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
keystone-apache-public-error.log:2020-11-05 11:25:52.261869 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
keystone-apache-public-error.log:2020-11-05 11:25:52.261871 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines [SQL: SELECT 1]
keystone-apache-public-error.log:2020-11-05 11:25:52.261873 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines (Background on this error at: http://sqlalche.me/e/e3q8)
keystone-apache-public-error.log:2020-11-05 11:25:52.261877 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines \x1b[00

I see this in mariadb logs

2020-11-05 23:59:55 102752 [Warning] Aborted connection 102752 to db: 'cinder' user: 'cinder' host: 'ctl-os2.hpc.siue.edu' (Got an error reading communication packets)
2020-11-05 23:59:57 102756 [Warning] Aborted connection 102756 to db: 'cinder' user: 'cinder' host: 'ctl-os2.hpc.siue.edu' (Got an error reading communication packets)
2020-11-05 23:59:59 102760 [Warning] Aborted connection 102760 to db: 'cinder' user: 'cinder' host: 'ctl-os2.hpc.siue.edu' (Got an error reading communication packets)

As I'm sure you can imagine once keystone hangs up due to the open file issue the cluster pretty much dies.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-09:

#4

Curious, do I need to rerun deploy each time I have to reboot or is that something you typically don't see a need for? FWIW, all of these are new and fresh CentOS 8 servers with little to no modification.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-11-09:

#5

No, single member reboots should not need any extra intervention.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-11:

#6

I have rerun deploy on all controllers. I will go without rebooting for 2 weeks then get back to you. As a side note, I have checked everything related to limits for everything system wide from, system processes to user to containers etc. They all seem to have the correct values set. I'm guessing this is some type of CentOS 8 bug. I'll do some searching and get back after the 2 weeks.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-12:

#7

Rerunning deploy did not fix the issue it popped back up today. I'm out of ideas this is a production system so I'm unsure what I can do to fix it.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-16:

#8

I am going to just wipe one controller at a time and start from a fresh install again. I cannot figure out any reason it is happening. I even ran an upgrade and it didn't help. I guess if it happens again, I will post back my globals.yml. I plan on not doing anything to the fresh centos 8 controller install outside of configuring network and prereqs for kolla. Hoping to fix this before the end of year because we have to go into completely stable production because of a hpc grant.

Revision history for this message

Mark Goddard (mgoddard) wrote on 2020-11-17:

#9

If you want to debug the issue, I would suggest using the ss and lsof tools in the memcache container.

Although it says too many open files, I would guess it might actually mean sockets:

Too many open connections
accept4(): Too many open files

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-11-17:

#10

Well, sockets are files. I think all issues I've seen about 'too many open files' were about too many open sockets... as they are more likely to be opened for a longer period of time.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-17:

#11

This seems very high am I wrong?

(memcached)[root@ctl-os1 /]# lsof | grep memcached | wc -l
44546

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-17:

#12

The one above is currently crashed due to that many open files. Here is a freshly rebooted one.

(memcached)[root@ctl-os2 /]# lsof | grep memcached | wc -l
2706

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-17:

#13

I found this maybe it is related?
https://bugs.launchpad.net/kolla-ansible/+bug/1892210

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-17:

#14

Restarting the neutron_server container drops the open files from tens of thousands to mid to high 2000. I do not see the advanced memcached pool in the neutron.conf yet. So maybe this isn't in the pip release yet?

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-17:

#15

I added the advanced memcached pool true config value to /etc/kolla/config/neutron.conf then reconfigured the controllers neutron tag. I will report back once I check lsof | grep memcached | wc -l again in a couple days.

Revision history for this message

Mitchell Walls (mitchwalls23) wrote on 2020-11-18:

#16

This is definitely fixed by the advanced memcached pool setting in neutron server. For anyone new to kolla this is how to mitigate until Neutron is fixed.

1) Create/edit file /etc/kolla/config/neutron.conf
2) Add this to contents of the above file.
[keystone_authtoken]
memcache_use_advanced_pool = True
3) Run kolla-ansible -i inventory -t neutron --limit controller1,controller2,controller3 reconfigure

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-11-18:

#17

Thank you for the feedback, Mitchell! Much appreciated. And I'm glad you got it fixed.

If you use the patched Kolla-Ansible, you will get the same effect without an override, but your solution will work for anyone.

kolla-ansible

Too many open files after rebooting each controller node

Bug Description

Other bug subscribers

Remote bug watches