Too many open files when large number of routers on a host

Bug #1737866 reported by Xav Paice on 2017-12-13
42
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack neutron-openvswitch charm
Undecided
Unassigned
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Medium
James Page
Ocata
Medium
James Page
Pike
Medium
James Page
openvswitch (Ubuntu)
Medium
Unassigned
Xenial
Medium
James Page
Artful
Medium
Unassigned
Bionic
Medium
Unassigned

Bug Description

[Impact]
OpenStack environments running large numbers of routers and dhcp agents on a single host can hit the NOFILES limit in OVS, resulting in broken operation of virtual networking.

[Test Case]
Deploy openstack environment; create large number of virtual networks and routers.
OVS will start to error with 'Too many open files'

[Regression Potential]
Minimal - we're just increasing the NOFILE limit via the systemd service definition.

[Original Bug Report]
When there are a large number of routers and dhcp agents on a host, we see a syslog error repeated:

"hostname ovs-vswitchd: ovs|1762125|netlink_socket|ERR|fcntl: Too many open files"

If I check the number of filehandles owned by the pid for "ovs-vswitchd unix:/var/run/openvswitch/db.sock" I see close to/at 65535 files.

If I then run the following, we double the limit and (in our case) saw the count rise to >80000:

prlimit -p $pid --nofile=131070

We need to be able to:
- monitor via nrpe, if the process is running short on filehandles
- configure the limit so we have the option to not run out.

Currently, if I restart the process, we'll lose this setting.

Needless to say, openvswitch running out of filehandles causes all manner of problems for services which use it.

James Page (james-page) wrote :

This needs fixing in the systemd units for ovs; raising packaging tasks.

Changed in openvswitch (Ubuntu):
status: New → Triaged
milestone: none → ubuntu-18.04
importance: Undecided → Medium
Changed in charm-neutron-openvswitch:
status: New → Invalid
Changed in openvswitch (Ubuntu Artful):
status: New → Triaged
Changed in openvswitch (Ubuntu Xenial):
status: New → Triaged
Changed in openvswitch (Ubuntu Artful):
importance: Undecided → Medium
Changed in openvswitch (Ubuntu Xenial):
importance: Undecided → Medium
Alvaro Uría (aluria) wrote :

Hi, this is affecting a production environment running xenial+ocata. Could we have an ETA for the systemd config to land Xenial packages?

Thanks much!

James Page (james-page) wrote :

I've uploaded fixes to bionic; will do SRU's for Xenial->Artful once that's into the release pocket and the current 2.8.1 ovs stable release is through.

Changed in openvswitch (Ubuntu Bionic):
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package openvswitch - 2.8.1-0ubuntu3

---------------
openvswitch (2.8.1-0ubuntu3) bionic; urgency=medium

  * Updates to systemd configuration:
    - Move to distinct units for ovsdb-server and ovs-vswitchd.
  * Drop obsolete upstart configuration file.
  * Bump nofiles to 1048576 for ovs daemons (LP: #1737866).
  * d/control: Bump minimum debhelper version to 10, drop BD on
    dh-systemd.
  * d/p/dpif-kernel-gre-mtu-workaround.patch,
    d/p/dpif-netlink-rtnl-Use-65000-instead-of-65535-as-tunnel-MTU.patch:
    Cherry pick in-flight fixes for workaround to correctly set MTU
    of GRE devices via netlink (LP: #1742505).

 -- James Page <email address hidden> Thu, 18 Jan 2018 15:26:41 +0200

Changed in openvswitch (Ubuntu Bionic):
status: Fix Committed → Fix Released
Xav Paice (xavpaice) wrote :

Any update on when we might land an SRU for Xenial?

Xav Paice (xavpaice) wrote :

Subscribed field-high because we have an active environment (more?) that are are affected by this using Xenial/Ocata, and we really need that SRU released.

James Page (james-page) on 2018-08-17
Changed in openvswitch (Ubuntu Artful):
status: Triaged → Won't Fix
James Page (james-page) wrote :

As restarts of OVS are service impacting, I was holding off on uploading this fix until we had point releases for OVS - which we now do.

Changed in openvswitch (Ubuntu Xenial):
status: Triaged → In Progress
Changed in cloud-archive:
status: New → Fix Released
description: updated
Changed in openvswitch (Ubuntu Xenial):
assignee: nobody → James Page (james-page)

Hello Xav, or anyone else affected,

Accepted openvswitch into ocata-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ocata-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ocata-needed to verification-ocata-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ocata-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ocata-needed
James Page (james-page) wrote :

uploaded to xenial for SRU team review

Robie Basak (racb) wrote :

> Subscribed field-high because we have an active environment (more?) that are are affected by this using Xenial/Ocata, and we really need that SRU released.

Are you aware that systemd supports drop-in overrides for individual configuration items? So until this SRU is released, you could work around on a production system by dropping in the right file in /etc that will make exactly the same functional change the SRU will. See systemd.unit(5) for details.

Robie Basak (racb) wrote :

Hello Xav, or anyone else affected,

Accepted openvswitch into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/openvswitch/2.5.5-0ubuntu0.16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in openvswitch (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
tags: added: sts-sru-needed
Hua Zhang (zhhuabj) wrote :

I have verified openvswitch-switch=2.5.5-0ubuntu0.16.04.1, it looks good to me.

root@16.04:/tmp/ovs/openvswitch-2.5.5$ grep -r '1048576' ./debian/
./debian/changelog: * Bump nofiles to 1048576 for ovs daemons (LP: #1737866).
./debian/openvswitch-switch.openvswitch-nonetwork.service:LimitNOFILE=1048576

root@16.04:~$ grep -r '1048576' /lib/systemd/system/openvswitch*
/lib/systemd/system/openvswitch-nonetwork.service:LimitNOFILE=1048576

root@node1:~# grep -r '1048576' /proc/`pidof ovsdb-server`/limits
Max open files 1048576 1048576 files

root@node1:~# grep -r '1048576' /proc/`pidof ovs-vswitchd`/limits
Max open files 1048576 1048576 files

Hua Zhang (zhhuabj) on 2018-09-19
tags: added: verification-done-xenial
removed: verification-needed-xenial
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers