neutron

applying iptables rules takes too long when large scale deployment

Bug #1352826 reported by Chen Ya Qin on 2014-08-05

This bug affects 6 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Expired	Medium	Unassigned

Bug Description

I found the time to finishing the applying iptables rules( in neutron/agent/linux/iptables_manager.py _apply_synchronized,_modify_rules) takes nearly more than half an hour( 36 minutes in my environment) when the number of active vms in cloud is more than 880.
This will lead that the time of bringing new created port up when booting an instance will take very long, and if the vif_plugging_is_fatal is true, the vif_plugging_timeout is not big enough, booting will fail.
Although optimization on _modify_rules in patch https://review.openstack.org/#/c/77549/ did help shorten the cost, but still the time is not short enough (it takes 17 minutes when the number of active vms in cloud is more than 880 in my environment).
Further optimazation on _modify_rules need be done to fit the situation of Large-scale deployment.

See original description

Tags:

Revision history for this message

Maru Newby (maru) wrote on 2014-08-05:

Can you please provide more details about how to reproduce the problem? The number of active VMs alone is not sufficient. The deployment configuration (including hardware), the security groups that are configured, etc, are all relevant.

Changed in neutron:
status:	New → Incomplete

Revision history for this message

Chen Ya Qin (yqchen) wrote on 2014-08-06:

We have got 11 hosts, one is the controller, other 10 are compute nodes. Each host with 32 CPUs.
each CPU is :
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB

meminfo:
MemTotal: 297597496 kB
MemFree: 277073576 kB
Buffers: 172792 kB
Cached: 2384112 kB
SwapCached: 0 kB
Active: 16322372 kB
Inactive: 1370652 kB
Active(anon): 15143856 kB
Inactive(anon): 352 kB
Active(file): 1178516 kB
Inactive(file): 1370300 kB
Unevictable: 329208 kB
Mlocked: 63520 kB
SwapTotal: 4194296 kB
SwapFree: 4194296 kB

The config of security_group in nova.conf:
security_group_api=neutron
# number of security groups per project (default: 10)
quota_security_groups=-1
# number of security rules per security group (default: 20)
quota_security_group_rules=-1

The config of security_group in neutron.conf:
# being used in conjunction with nova security groups and/or metadata service.
# number of security groups allowed per tenant, and minus means unlimited
quota_security_group = -1
# number of security group rules allowed per tenant, and minus means unlimited
quota_security_group_rule = -1

We ran a script which will spwan 5 threads, each thread will boot 20 instances and then delete 10 instances, then stop 15 seconds, and go on with the same operations, util the total number of vms is 1250. The result I post in the bug description is that we have already deleted 5 hosts from the cloud.
When we set vif_plugging_is_fatal= true, vif_plugging_timeout=300, found in the middle, booting would always fail, so we set vif_plugging_is_fatal= false, and continue with the booting. Now the exact number of active vms is 885 now (980 is the approximate number of all vms including shutdown vms). The port number is 978. Then when I tried to boot instance with vif_plugging_is_fatal= true, vif_plugging_timeout=300, booting failed. When I got into the _modify_rules when booting an instance, found the number of iptables rules was more than 50000 so the time spent on modify_rules was more then 30 minutes so that booting failed.

To reproduce the problem (_modify_rules spend too much time ), I think you can try to create thounds of iptables rules in cloud, and boot an instance on a host with limited capacity and with vif_plugging_is_fatal= true, vif_plugging_timeout=300,theoretically, you will fail the booting, and get the exception : VirtualInterfaceCreateException() because of the overtime of _modity_rules.

We have got 11 hosts, one is the controller, other 10 are compute nodes.  Each host with 32 CPUs.
each CPU is :
processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping        : 7
cpu MHz         : 1200.000
cache size      : 20480 KB

meminfo:
MemTotal:       297597496 kB
MemFree:        277073576 kB
Buffers:          172792 kB
Cached:          2384112 kB
SwapCached:            0 kB
Active:         16322372 kB
Inactive:        1370652 kB
Active(anon):   15143856 kB
Inactive(anon):      352 kB
Active(file):    1178516 kB
Inactive(file):  1370300 kB
Unevictable:      329208 kB
Mlocked:           63520 kB
SwapTotal:       4194296 kB
SwapFree:        4194296 kB

The config of security_group in  nova.conf:
security_group_api=neutron
# number of security groups per project (default: 10)
quota_security_groups=-1
# number of security rules per security group (default: 20)
quota_security_group_rules=-1

The config of security_group in  neutron.conf:
# being used in conjunction with nova security groups and/or metadata service.
# number of security groups allowed per tenant, and minus means unlimited
quota_security_group = -1
# number of security group rules allowed per tenant, and minus means unlimited
quota_security_group_rule = -1

We ran a script which will spwan 5 threads, each thread will boot 20 instances and then delete 10 instances, then stop 15 seconds, and go on with the same operations, util the total number of vms is 1250. The result I post in the bug description is that we have already deleted 5 hosts from the cloud. 
When we set vif_plugging_is_fatal= true,  vif_plugging_timeout=300, found in the middle,  booting would always fail, so we set  vif_plugging_is_fatal= false, and continue with the booting. Now the exact number of active vms is 885 now (980 is the approximate number of all vms including shutdown vms).  The port number is 978. Then when I tried to boot instance with  vif_plugging_is_fatal= true,  vif_plugging_timeout=300, booting  failed. When I got into the _modify_rules  when booting an instance, found the number of iptables rules was more than 50000 so the time spent on modify_rules was more then 30 minutes so that booting failed.

To reproduce the problem (_modify_rules spend too much time ), I think you can try to create thounds of iptables rules in cloud,   and boot an instance on a host with limited capacity and with vif_plugging_is_fatal= true,  vif_plugging_timeout=300,theoretically, you will fail the booting, and get the exception : VirtualInterfaceCreateException() because of the overtime of _modity_rules.

description:

updated

Chen Ya Qin (yqchen) on 2014-08-06

Changed in neutron:
status:	Incomplete → New

Eugene Nikanorov (enikanorov) on 2014-08-06

tags:

added: loadimpact

Chen Ya Qin (yqchen) on 2014-08-07

Changed in neutron:
assignee:	nobody → Chen Ya Qin (yqchen)

Chen Ya Qin (yqchen) on 2014-08-07

Changed in neutron:
assignee:	Chen Ya Qin (yqchen) → nobody

Revision history for this message

Akihiro Motoki (amotoki) wrote on 2014-08-07:

Similar to bug 1314189.

tags:	added: sg-fw
Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Qin Zhao (zhaoqin) wrote on 2014-08-20:

I changed the searching algorithm on my machine to reduce the complexity of _modify_rules() from O(n^2) to O(nlogn). Then this looping can complete in several seconds. Before proposing the new code, I hope to commit a unit test for _modify_rules() first in https://bugs.launchpad.net/neutron/+bug/1359072, so that we will not make _modify_rules() become wrong in the future.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-20: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115719

Changed in neutron:
assignee:	nobody → Qin Zhao (zhaoqin)
status:	New → In Progress

Revision history for this message

Qin Zhao (zhaoqin) wrote on 2014-08-25:

Hi Chen Ya Qin, I proposed a new implementation of _modify_rules() to fix your problem. Please help to validate this code change, and provide your feedback. Although I do not think the code change can be accepted in short term, your testing and feedback will be a very important input.

Revision history for this message

Hua Zhang (zhhuabj) wrote on 2014-08-27:

One quick and dirty test from my colleague @nobuto shows that iptables applying time can be promoted from 900 seconds to 11 seconds after having this patch 5 of https://review.openstack.org/#/c/115719/ patched with 3,400 instances running and 80,000 lines in iptables-save.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2014-08-27:

correction: I used patchset 2 for the quick test. I didn't have a chance to try patchset 5 yet.

Revision history for this message

Qin Zhao (zhaoqin) wrote on 2014-08-27:

Zhang Hua and Nobuto, thanks for your testing! Patch set 2 is enough for improving the performance, and it is safe. The complexity of patch set 5 is almost same with patch set 2, in fact I have not carefully measured which one is faster. Now patch set 5 is not fully tested. I hope more stackers can have a try and give me feedback.
Thanks again!

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2014-10-17:

#10

Are you using ipset?

That may help reducing the iptables size.

OpenStack Infra (hudson-openstack) on 2014-12-03

Changed in neutron:
assignee:	Qin Zhao (zhaoqin) → Eugene Nikanorov (enikanorov)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-03:

#11

Fix proposed to branch: master
Review: https://review.openstack.org/138793

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-03: Change abandoned on neutron (master)

#12

Change abandoned by enikanorov (<email address hidden>) on branch: master
Review: https://review.openstack.org/138793

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-26:

#13

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/115719
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-29:

#14

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/115719

Sheena Conant (sheena-conant) on 2016-05-11

tags:

added: scale

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2017-02-01:

#15

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee:	Eugene Nikanorov (enikanorov) → nobody
status:	In Progress → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-04-02:

#16

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.