applying iptables rules takes too long when large scale deployment

Bug #1352826 reported by Chen Ya Qin on 2014-08-05
This bug affects 6 people
Affects Status Importance Assigned to Milestone

Bug Description

I found the time to finishing the applying iptables rules( in neutron/agent/linux/ _apply_synchronized,_modify_rules) takes nearly more than half an hour( 36 minutes in my environment) when the number of active vms in cloud is more than 880.
This will lead that the time of bringing new created port up when booting an instance will take very long, and if the vif_plugging_is_fatal is true, the vif_plugging_timeout is not big enough, booting will fail.
Although optimization on _modify_rules in patch did help shorten the cost, but still the time is not short enough (it takes 17 minutes when the number of active vms in cloud is more than 880 in my environment).
Further optimazation on _modify_rules need be done to fit the situation of Large-scale deployment.

Maru Newby (maru) wrote :

Can you please provide more details about how to reproduce the problem? The number of active VMs alone is not sufficient. The deployment configuration (including hardware), the security groups that are configured, etc, are all relevant.

Changed in neutron:
status: New → Incomplete
Chen Ya Qin (yqchen) wrote :

We have got 11 hosts, one is the controller, other 10 are compute nodes. Each host with 32 CPUs.
each CPU is :
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
stepping : 7
cpu MHz : 1200.000
cache size : 20480 KB

MemTotal: 297597496 kB
MemFree: 277073576 kB
Buffers: 172792 kB
Cached: 2384112 kB
SwapCached: 0 kB
Active: 16322372 kB
Inactive: 1370652 kB
Active(anon): 15143856 kB
Inactive(anon): 352 kB
Active(file): 1178516 kB
Inactive(file): 1370300 kB
Unevictable: 329208 kB
Mlocked: 63520 kB
SwapTotal: 4194296 kB
SwapFree: 4194296 kB

The config of security_group in nova.conf:
# number of security groups per project (default: 10)
# number of security rules per security group (default: 20)

The config of security_group in neutron.conf:
# being used in conjunction with nova security groups and/or metadata service.
# number of security groups allowed per tenant, and minus means unlimited
quota_security_group = -1
# number of security group rules allowed per tenant, and minus means unlimited
quota_security_group_rule = -1

We ran a script which will spwan 5 threads, each thread will boot 20 instances and then delete 10 instances, then stop 15 seconds, and go on with the same operations, util the total number of vms is 1250. The result I post in the bug description is that we have already deleted 5 hosts from the cloud.
When we set vif_plugging_is_fatal= true, vif_plugging_timeout=300, found in the middle, booting would always fail, so we set vif_plugging_is_fatal= false, and continue with the booting. Now the exact number of active vms is 885 now (980 is the approximate number of all vms including shutdown vms). The port number is 978. Then when I tried to boot instance with vif_plugging_is_fatal= true, vif_plugging_timeout=300, booting failed. When I got into the _modify_rules when booting an instance, found the number of iptables rules was more than 50000 so the time spent on modify_rules was more then 30 minutes so that booting failed.

To reproduce the problem (_modify_rules spend too much time ), I think you can try to create thounds of iptables rules in cloud, and boot an instance on a host with limited capacity and with vif_plugging_is_fatal= true, vif_plugging_timeout=300,theoretically, you will fail the booting, and get the exception : VirtualInterfaceCreateException() because of the overtime of _modity_rules.

description: updated
Chen Ya Qin (yqchen) on 2014-08-06
Changed in neutron:
status: Incomplete → New
tags: added: loadimpact
Chen Ya Qin (yqchen) on 2014-08-07
Changed in neutron:
assignee: nobody → Chen Ya Qin (yqchen)
Chen Ya Qin (yqchen) on 2014-08-07
Changed in neutron:
assignee: Chen Ya Qin (yqchen) → nobody
Akihiro Motoki (amotoki) wrote :

Similar to bug 1314189.

tags: added: sg-fw
Changed in neutron:
importance: Undecided → Medium
Qin Zhao (zhaoqin) wrote :

I changed the searching algorithm on my machine to reduce the complexity of _modify_rules() from O(n^2) to O(nlogn). Then this looping can complete in several seconds. Before proposing the new code, I hope to commit a unit test for _modify_rules() first in, so that we will not make _modify_rules() become wrong in the future.

Fix proposed to branch: master

Changed in neutron:
assignee: nobody → Qin Zhao (zhaoqin)
status: New → In Progress
Qin Zhao (zhaoqin) wrote :

Hi Chen Ya Qin, I proposed a new implementation of _modify_rules() to fix your problem. Please help to validate this code change, and provide your feedback. Although I do not think the code change can be accepted in short term, your testing and feedback will be a very important input.

Hua Zhang (zhhuabj) wrote :

One quick and dirty test from my colleague @nobuto shows that iptables applying time can be promoted from 900 seconds to 11 seconds after having this patch 5 of patched with 3,400 instances running and 80,000 lines in iptables-save.

Nobuto Murata (nobuto) wrote :

correction: I used patchset 2 for the quick test. I didn't have a chance to try patchset 5 yet.

Qin Zhao (zhaoqin) wrote :

Zhang Hua and Nobuto, thanks for your testing! Patch set 2 is enough for improving the performance, and it is safe. The complexity of patch set 5 is almost same with patch set 2, in fact I have not carefully measured which one is faster. Now patch set 5 is not fully tested. I hope more stackers can have a try and give me feedback.
Thanks again!

Miguel Angel Ajo (mangelajo) wrote :

Are you using ipset?

That may help reducing the iptables size.

Changed in neutron:
assignee: Qin Zhao (zhaoqin) → Eugene Nikanorov (enikanorov)

Fix proposed to branch: master

Change abandoned by enikanorov (<email address hidden>) on branch: master

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Change abandoned by Assaf Muller (<email address hidden>) on branch: master

tags: added: scale

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → nobody
status: In Progress → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers