Cannot Scale to more than 116 VMS and subnets

Bug #1718266 reported by Sai Sindhur Malleni
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Sai Sindhur Malleni

Bug Description

Description:
During scale testing, we are unable to create more than 116 subnet and VMs due to some kernel params limitations on the dnsmasq processes.

We are using an OpenStack setup with 1 controller and 11 compute nodes. We are executing the following usecase

1. Create a network
2. Create a subnet
3. Boot an instance on this subnet

We do the above sequence of operations 500 times at a concurrency of 8.

Even after several attempts we are unable to scale past 116 VMs (each VM is on its own subnet). 116 seems to be the hard limit. The port never transitionas into active as even though the VIF Plugging happens, it fails the provioning block (DHCP), Since Ml2/ODL makes use of the neutron DHCP agent for DHCP, on looking in the DHCP agent logs we see

2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent [req-288477aa-318a-40c3-954e-dd6fc98c6c1b - - - - -] Unable to enable dhcp for bb6cdb16-72c0-4cc4-a316-69ebcd7633b2.: ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/dhcp/agent.py", line 142, in call_driver
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent getattr(driver, action)(**action_kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 218, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self.spawn_process()
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 439, in spawn_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent self._spawn_or_reload_process(reload_with_HUP=False)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/dhcp.py", line 453, in _spawn_or_reload_process
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent pm.enable(reload_cfg=reload_with_HUP)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 96, in enable
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent run_as_root=self.run_as_root)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 903, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent log_fail_as_error=log_fail_as_error, **kwargs)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent raise ProcessExecutionError(msg, returncode=returncode)
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent ProcessExecutionError: Exit code: 5; Stdin: ; Stdout: ; Stderr:
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent dnsmasq: failed to create inotify: Too many open files
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.124 91663 ERROR neutron.agent.dhcp.agent
2017-09-13 21:45:38.260 91663 ERROR neutron.agent.linux.utils [req-d0ade748-22ea-4a45-ba10-277d45f20981 - - - - -] Exit code: 5; Stdin: ; Stdout: ; Stderr:
dnsmasq: failed to create inotify: Too many open files

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1474515#c8 we are reaching the limit on inotify.max_user_instances which is set to 128. Increasing this limit using ,sysctl -w fs.inotify.max_user_instances=256 >> /etc/sysctl.conf we are able to scale past this limitation of 116 subnets and VMs.

Environment:
Pike
RHEL 7.4

Additional info:

Additional details can be found in:
https://bugzilla.redhat.com/show_bug.cgi?id=1474515#
https://bugzilla.redhat.com/show_bug.cgi?id=1491505

Changed in tripleo:
milestone: none → queens-1
importance: Undecided → High
status: New → Triaged
tags: added: networking pike-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/505381

Changed in tripleo:
assignee: nobody → Sai Sindhur Malleni (smalleni)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/505381
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d2d0c3ff00de9b62382193d942239d543aa9499f
Submitter: Jenkins
Branch: master

commit d2d0c3ff00de9b62382193d942239d543aa9499f
Author: Sai Sindhur Malleni <email address hidden>
Date: Tue Sep 19 15:12:35 2017 -0400

    Bump fs.inotify.max_user_instances for scale

    Since each dnsmasq process consumes one inotify socket, the default
    value of fs.inotify.max_user_instances which is 128 lets us scale to
    only around a 116 neutron subnets (a few other sockets are used by other
    processes on the system). Since, we need to provide better defaults,
    this patch proposes to bump this value to 1024 by default, while giving
    the user a way to cahnge it. Based on
    https://unix.stackexchange.com/a/13757 each inotify watch takes 1KB of
    memory and we have fs.inotify.max_user_watches set to 8192 by default.
    This means that even in the worst case we won't be using more than 8MB
    of memory. Bumping the fs.inotify.max_user_instances value to 1024 is
    safe because there is fs.inotify.max_user_watches which caps the total
    number of files that can be watched by all the inotify instances a user
    has.

    Related Bugs:
    https://bugzilla.redhat.com/show_bug.cgi?id=1474515
    https://bugzilla.redhat.com/show_bug.cgi?id=1491505

    Change-Id: I39664312bf6cf06f1e1ca2e86ffd86fb9a4582ad
    Closes-Bug: 1718266

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/509433

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/509514

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/509521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/pike)

Reviewed: https://review.openstack.org/509433
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=6a7428898f7fa7a0d46e1de60ca57e7ec6ce14c9
Submitter: Jenkins
Branch: stable/pike

commit 6a7428898f7fa7a0d46e1de60ca57e7ec6ce14c9
Author: Sai Sindhur Malleni <email address hidden>
Date: Tue Sep 19 15:12:35 2017 -0400

    Bump fs.inotify.max_user_instances for scale

    Since each dnsmasq process consumes one inotify socket, the default
    value of fs.inotify.max_user_instances which is 128 lets us scale to
    only around a 116 neutron subnets (a few other sockets are used by other
    processes on the system). Since, we need to provide better defaults,
    this patch proposes to bump this value to 1024 by default, while giving
    the user a way to cahnge it. Based on
    https://unix.stackexchange.com/a/13757 each inotify watch takes 1KB of
    memory and we have fs.inotify.max_user_watches set to 8192 by default.
    This means that even in the worst case we won't be using more than 8MB
    of memory. Bumping the fs.inotify.max_user_instances value to 1024 is
    safe because there is fs.inotify.max_user_watches which caps the total
    number of files that can be watched by all the inotify instances a user
    has.

    Related Bugs:
    https://bugzilla.redhat.com/show_bug.cgi?id=1474515
    https://bugzilla.redhat.com/show_bug.cgi?id=1491505

    Change-Id: I39664312bf6cf06f1e1ca2e86ffd86fb9a4582ad
    Closes-Bug: 1718266
    (cherry picked from commit d2d0c3ff00de9b62382193d942239d543aa9499f)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/509514
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1db252b13cfc0889120f870adb64e86a89bbc069
Submitter: Jenkins
Branch: stable/ocata

commit 1db252b13cfc0889120f870adb64e86a89bbc069
Author: Sai Sindhur Malleni <email address hidden>
Date: Tue Sep 19 15:12:35 2017 -0400

    Bump fs.inotify.max_user_instances for scale

    Since each dnsmasq process consumes one inotify socket, the default
    value of fs.inotify.max_user_instances which is 128 lets us scale to
    only around a 116 neutron subnets (a few other sockets are used by other
    processes on the system). Since, we need to provide better defaults,
    this patch proposes to bump this value to 1024 by default, while giving
    the user a way to cahnge it. Based on
    https://unix.stackexchange.com/a/13757 each inotify watch takes 1KB of
    memory and we have fs.inotify.max_user_watches set to 8192 by default.
    This means that even in the worst case we won't be using more than 8MB
    of memory. Bumping the fs.inotify.max_user_instances value to 1024 is
    safe because there is fs.inotify.max_user_watches which caps the total
    number of files that can be watched by all the inotify instances a user
    has.

    Related Bugs:
    https://bugzilla.redhat.com/show_bug.cgi?id=1474515
    https://bugzilla.redhat.com/show_bug.cgi?id=1491505

    Change-Id: I39664312bf6cf06f1e1ca2e86ffd86fb9a4582ad
    Closes-Bug: 1718266
    (cherry picked from commit d2d0c3ff00de9b62382193d942239d543aa9499f)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.2.3

This issue was fixed in the openstack/tripleo-heat-templates 6.2.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.2

This issue was fixed in the openstack/tripleo-heat-templates 7.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 8.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 8.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/newton)

Reviewed: https://review.openstack.org/509521
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9114178c337d1f515991946d2baae0e1f413c1e7
Submitter: Zuul
Branch: stable/newton

commit 9114178c337d1f515991946d2baae0e1f413c1e7
Author: Sai Sindhur Malleni <email address hidden>
Date: Tue Sep 19 15:12:35 2017 -0400

    Bump fs.inotify.max_user_instances for scale

    Since each dnsmasq process consumes one inotify socket, the default
    value of fs.inotify.max_user_instances which is 128 lets us scale to
    only around a 116 neutron subnets (a few other sockets are used by other
    processes on the system). Since, we need to provide better defaults,
    this patch proposes to bump this value to 1024 by default, while giving
    the user a way to cahnge it. Based on
    https://unix.stackexchange.com/a/13757 each inotify watch takes 1KB of
    memory and we have fs.inotify.max_user_watches set to 8192 by default.
    This means that even in the worst case we won't be using more than 8MB
    of memory. Bumping the fs.inotify.max_user_instances value to 1024 is
    safe because there is fs.inotify.max_user_watches which caps the total
    number of files that can be watched by all the inotify instances a user
    has.

    Related Bugs:
    https://bugzilla.redhat.com/show_bug.cgi?id=1474515
    https://bugzilla.redhat.com/show_bug.cgi?id=1491505

    Change-Id: I39664312bf6cf06f1e1ca2e86ffd86fb9a4582ad
    Closes-Bug: 1718266
    (cherry picked from commit d2d0c3ff00de9b62382193d942239d543aa9499f)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.3.4

This issue was fixed in the openstack/tripleo-heat-templates 5.3.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.