Too many open files after rebooting each controller node

Bug #1901898 reported by Mitchell Walls
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
New
Undecided
Unassigned

Bug Description

After deploying the controllers in vmware, I originally didn't have enough ram for them. It ended up swapping. This was a month or so after deploying. I had to reboot each controller one at a time to increase the ram and cpu. Once I did that about 3-4 days later the controllers crashed again but this time pretty much in every container it was complaining about too many open files. This didn't happen for over a month before the controllers swapped since OpenStack wasn't used heavily in the beginning. This has happened 3 times ever since the initial reboot of OpenStack controllers. I changed ulimits and sysctl file max the first time it happened then rebooted each controller one at a time. As stated previously even after the rebooting it popped back up 2 more times. As of right now I'm still within the initial 3 days since last reboot.

Reproduce (haven't tested since I'm limited on time)
My environment, deploy 3 controllers in vmware but keep ram too low. Once the controllers swap power off one controller at a time and increase ram/cpu to needed amount. After doing this once OpenStack has run with fairly small usage 3-7? days later you might see that Too many files are open pretty much everywhere (I use memcached container to check if it happened again). Changing ulimits and file max doesn't help. Fixing it with rebooting one controller at a time then it should happen again.

My theory is that something is set during deployment/bootstrap that isn't persisting through reboots. This is probably OS/docker daemon setting. I would rerun deploy/redo controllers but importantly this is being used to teach cybersecurity and scientific computing labs.

Thanks for the help!

[root@ctl-os1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

[root@ctl-os1 ~]# uname -a
Linux ctl-os1 4.18.0-193.14.2.el8_2.x86_64 #1 SMP Sun Jul 26 03:54:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[root@ctl-os1 ~]# docker version
Client: Docker Engine - Community
 Version: 19.03.12
 API version: 1.40
 Go version: go1.13.10
 Git commit: 48a66213fe
 Built: Mon Jun 22 15:46:54 2020
 OS/Arch: linux/amd64
 Experimental: false

Server: Docker Engine - Community
 Engine:
  Version: 19.03.12
  API version: 1.40 (minimum version 1.12)
  Go version: go1.13.10
  Git commit: 48a66213fe
  Built: Mon Jun 22 15:45:28 2020
  OS/Arch: linux/amd64
  Experimental: false
 containerd:
  Version: 1.2.13
  GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version: 1.0.0-rc10
  GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version: 0.18.0
  GitCommit: fec3683

Kolla ansible version from pip freeze
kolla-ansible==10.1.0

Docker install type: source
Docker distribution: centos
Official Images

I don't think inventory or globals.yml are relevant.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Try running deploy again indeed. It is largely idempotent. I have never seen such an issue.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Could you provide some logs showing the context of the error messages?

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :
Download full text (9.6 KiB)

This is what I see when I check if the issue is happening. I use `docker logs memcached`. I see some variation of this across all of the docker containers.

Just making sure, did you happen to deploy on CentOS 8? Just wondering if it is related to that.

Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
getpeername: Transport endpoint is not connected
Failed to write, and not due to blocking: Broken pipe
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections
accept4(): Too many open files
Too many open connections

Here is a Keystone example as soon as that starts to happen as well.

keystone-apache-public-error.log:2020-11-05 11:25:52.261790 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
keystone-apache-public-error.log:2020-11-05 11:25:52.261793 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
keystone-apache-public-error.log:2020-11-05 11:25:52.261794 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines
keystone-apache-public-error.log:2020-11-05 11:25:52.261796 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines The above exception was the direct cause of the following exception:
keystone-apache-public-error.log:2020-11-05 11:25:52.261798 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines
keystone-apache-public-error.log:2020-11-05 11:25:52.261799 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
keystone-apache-public-error.log:2020-11-05 11:25:52.261801 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
keystone-apache-public-error.log:2020-11-05 11:25:52.261803 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines connection.scalar(select([1]))
keystone-apache-public-error.log:2020-11-05 11:25:52.261805 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 914, in scalar
keystone-apache-public-error.log:2020-11-05 11:25:52.261807 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines return self.execute(object_, *multiparams, **params).scalar()
keystone-apache-public-error.log:2020-11-05 11:25:52.261809 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 984, in execute
keystone-apache-public-error.log:2020-11-05 11:25:52.261813 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines return meth(self, multiparams, params)
keystone-apache-public-error.log:2020-11-05 11:25:52.261814 2020-11-05 11:25:52.259 40 ERROR oslo_db.sqlalchemy.engines File "/var/lib/kolla/venv/lib64/python3.6/...

Read more...

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

Curious, do I need to rerun deploy each time I have to reboot or is that something you typically don't see a need for? FWIW, all of these are new and fresh CentOS 8 servers with little to no modification.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

No, single member reboots should not need any extra intervention.

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

I have rerun deploy on all controllers. I will go without rebooting for 2 weeks then get back to you. As a side note, I have checked everything related to limits for everything system wide from, system processes to user to containers etc. They all seem to have the correct values set. I'm guessing this is some type of CentOS 8 bug. I'll do some searching and get back after the 2 weeks.

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

Rerunning deploy did not fix the issue it popped back up today. I'm out of ideas this is a production system so I'm unsure what I can do to fix it.

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

I am going to just wipe one controller at a time and start from a fresh install again. I cannot figure out any reason it is happening. I even ran an upgrade and it didn't help. I guess if it happens again, I will post back my globals.yml. I plan on not doing anything to the fresh centos 8 controller install outside of configuring network and prereqs for kolla. Hoping to fix this before the end of year because we have to go into completely stable production because of a hpc grant.

Revision history for this message
Mark Goddard (mgoddard) wrote :

If you want to debug the issue, I would suggest using the ss and lsof tools in the memcache container.

Although it says too many open files, I would guess it might actually mean sockets:

Too many open connections
accept4(): Too many open files

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Well, sockets are files. I think all issues I've seen about 'too many open files' were about too many open sockets... as they are more likely to be opened for a longer period of time.

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

This seems very high am I wrong?

(memcached)[root@ctl-os1 /]# lsof | grep memcached | wc -l
44546

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

The one above is currently crashed due to that many open files. Here is a freshly rebooted one.

(memcached)[root@ctl-os2 /]# lsof | grep memcached | wc -l
2706

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :
Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

Restarting the neutron_server container drops the open files from tens of thousands to mid to high 2000. I do not see the advanced memcached pool in the neutron.conf yet. So maybe this isn't in the pip release yet?

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

I added the advanced memcached pool true config value to /etc/kolla/config/neutron.conf then reconfigured the controllers neutron tag. I will report back once I check lsof | grep memcached | wc -l again in a couple days.

Revision history for this message
Mitchell Walls (mitchwalls23) wrote :

This is definitely fixed by the advanced memcached pool setting in neutron server. For anyone new to kolla this is how to mitigate until Neutron is fixed.

1) Create/edit file /etc/kolla/config/neutron.conf
2) Add this to contents of the above file.
[keystone_authtoken]
memcache_use_advanced_pool = True
3) Run kolla-ansible -i inventory -t neutron --limit controller1,controller2,controller3 reconfigure

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Thank you for the feedback, Mitchell! Much appreciated. And I'm glad you got it fixed.

If you use the patched Kolla-Ansible, you will get the same effect without an override, but your solution will work for anyone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.