do_refresh_security_group_rules in nova.virt.firewall is very slow

Bug #1062314 reported by Michael Still on 2012-10-05
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Michael Still
Folsom
High
Pádraig Brady
nova (Ubuntu)
Undecided
Unassigned
Nominated for Precise by Yolanda Robla
Quantal
Undecided
Unassigned

Bug Description

This is a bug against stable essex. I have made no attempt to determine if this is still a problem in Folsom at this stage.

During a sprint this week we took a nova region which was previously relatively idle and started turning up large numbers of instances using juju. We started to experience very slow instance starts, which I dug into. I should note that juju seems to trigger this behaviour by refreshing security groups when ports are exposed, but other openstack users will probably experience problems if they are trying to do non-trivial things with security groups.

It appears that do_refresh_security_group_rules can sometimes take a very long time to run, and it holds the "iptables" lock while doing this. This is a problem because launching a new instance needs to take the iptables lock, and can end up being blocked. An example slow instance start (nova logs editted for clarity):

(logs from scheduler node)
2012-10-05 08:06:28 run_instance
2012-10-05 08:06:29 cast to <<compute node>>

(logs from compute node)
2012-10-05 08:07:21 Starting instance...
2012-10-05 08:07:34 Starting toXML method
2012-10-05 08:07:43 Finished toXML method
2012-10-05 08:07:43 Called setup_basic_filtering in nwfilter
2012-10-05 08:07:43 Ensuring static filters

2012-10-05 08:08:48 Attempting to grab semaphore "iptables" for method "_do_refresh_provider_fw_rules"...
2012-10-05 08:24:00 Got semaphore "iptables" for method "_do_refresh_provider_fw_rules"...

2012-10-05 08:24:01 Creating image
2012-10-05 08:24:06 Instance is running
2012-10-05 08:25:28 Checking state
2012-10-05 08:25:30 Instance spawned successfully.

I instrumented utils.synchronized to include lock wait and help times like this (patch against essex):

diff --git a/nova/utils.py b/nova/utils.py
index 6535b06..2e01a15 100644
--- a/nova/utils.py
+++ b/nova/utils.py
@@ -926,10 +926,16 @@ def synchronized(name, external=False):
             LOG.debug(_('Attempting to grab semaphore "%(lock)s" for method '
                         '"%(method)s"...') % {'lock': name,
                                               'method': f.__name__})
+ started_waiting = time.time()
+
             with sem:
                 LOG.debug(_('Got semaphore "%(lock)s" for method '
- '"%(method)s"...') % {'lock': name,
- 'method': f.__name__})
+ '"%(method)s" after %(wait)f second wait...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_waiting})
+ started_working = time.time()
+
                 if external and not FLAGS.disable_process_locking:
                     LOG.debug(_('Attempting to grab file lock "%(lock)s" for '
                                 'method "%(method)s"...') %
@@ -945,6 +951,12 @@ def synchronized(name, external=False):
                 else:
                     retval = f(*args, **kwargs)

+ LOG.debug(_('Released semaphore "%(lock)s" for method '
+ '"%(method)s" after %(wait)f seconds of use...'),
+ {'lock': name,
+ 'method': f.__name__,
+ 'wait': time.time() - started_working})
+
             # If no-one else is waiting for it, delete it.
             # See note about possible raciness above.
             if not sem.balance < 1:

Taking a look at the five longest lock holds in my logs after this patch is applied, I get:

# grep "Released semaphore" /var/log/nova/nova-compute.log | grep iptables | awk '{print$15, $13}' | sort -n | tail -5
192.134270 "do_refresh_security_group_rules"
194.140478 "do_refresh_security_group_rules"
194.153729 "do_refresh_security_group_rules"
201.135854 "do_refresh_security_group_rules"
297.725837 "do_refresh_security_group_rules"

So I then instrumented do_refresh_security_group_rules to try and see what was slow. I used this patch (which I know is horrible):

diff --git a/nova/virt/firewall.py b/nova/virt/firewall.py
index f0f1594..99f580a 100644
--- a/nova/virt/firewall.py
+++ b/nova/virt/firewall.py
@@ -17,6 +17,8 @@
 # License for the specific language governing permissions and limitations
 # under the License.

+import time
+
 from nova import context
 from nova import db
 from nova import flags
@@ -167,16 +169,35 @@ class IptablesFirewallDriver(FirewallDriver):
                 self.iptables.ipv6['filter'].add_rule(chain_name, rule)

     def add_filters_for_instance(self, instance):
+ start_time = time.time()
         network_info = self.network_infos[instance['id']]
+ LOG.debug(_('Get network info took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
         chain_name = self._instance_chain_name(instance)
+ LOG.debug(_('Get chain name took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
         if FLAGS.use_ipv6:
             self.iptables.ipv6['filter'].add_chain(chain_name)
         self.iptables.ipv4['filter'].add_chain(chain_name)
+ LOG.debug(_('Add chain took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
         ipv4_rules, ipv6_rules = self._filters_for_instance(chain_name,
                                                             network_info)
+ LOG.debug(_('Filters for instance took %f seconds'),
+ time.time() - start_time)
+
+ start_time = time.time()
         self._add_filters('local', ipv4_rules, ipv6_rules)
         ipv4_rules, ipv6_rules = self.instance_rules(instance, network_info)
         self._add_filters(chain_name, ipv4_rules, ipv6_rules)
+ LOG.debug(_('Add filters took %f seconds'),
+ time.time() - start_time)

     def remove_filters_for_instance(self, instance):
         chain_name = self._instance_chain_name(instance)
@@ -362,9 +383,17 @@ class IptablesFirewallDriver(FirewallDriver):

     @utils.synchronized('iptables', external=True)
     def do_refresh_security_group_rules(self, security_group):
+ # TODO(mikal): why is security group passed in if its not used?
         for instance in self.instances.values():
+ remove_start = time.time()
             self.remove_filters_for_instance(instance)
+ add_start = time.time()
             self.add_filters_for_instance(instance)
+ LOG.debug(_('Refreshing security groups for instance (removal '
+ 'took %(remove)f seconds, add took %(add)f seconds)'),
+ {'remove': add_start - remove_start,
+ 'add': time.time() - add_start},
+ instance=instance)

     def refresh_provider_fw_rules(self):
         """See :class:`FirewallDriver` docs."""

This gives me log entries like this (not all are this slow):

2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] Add filters took 46.641673 seconds from (pid=8052) add_filters_for_instance /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:200
2012-10-05 13:27:43 DEBUG nova.virt.firewall [req-051d71e4-1a71-43f9-87a4-36686b5445c6 None None] [instance: 06ea0120-ed5e-4379-90b1-322e1de68950] Refreshing security groups for instance (removal took 0.000958 seconds, add took 46.643727 seconds) from (pid=8052) do_refresh_security_group_rules /usr/lib/python2.7/dist-packages/nova/virt/firewall.py:396

I'm still looking at this, but it looks like nova.virt.firewall.py IptablesFirewallDriver.instance_rules() is the problem. This method makes both database and rpc calls while holding the iptables lock.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Michael Still (mikal) wrote :

I think the issue here is that nova.virt.firewall.py IptablesFirewallDriver.instance_rules() is calling get_instance_nw_info() which is causing rpcs to be fired off _while_still_holding_the_iptables_lock. I suspect that the rpcs need to happen outside the lock.

From yet more instrumented code:

A synchronous RPC call is being made while a lock is held. This is probably a bug. Please report it. Include lines following this that start with ** please.
** multicall
** call
** call
** call
** get_instance_nw_info
** instance_rules
** add_filters_for_instance
** do_refresh_security_group_rules
** inner_while_holding_lock
** refresh_security_group_members
** refresh_security_group_members
** refresh_security_group_members
** wrapped
** _process_data
** wrapped
** _spawn_n_impl
** end of stack trace

Michael Still (mikal) on 2012-10-10
Changed in nova:
assignee: nobody → Michael Still (mikalstill)

Fix proposed to branch: master
Review: https://review.openstack.org/14326

Changed in nova:
status: New → In Progress

Reviewed: https://review.openstack.org/14326
Committed: http://github.com/openstack/nova/commit/ba585524e32965697c1a44c8fd743dea060bb1af
Submitter: Jenkins
Branch: master

commit ba585524e32965697c1a44c8fd743dea060bb1af
Author: Michael Still <email address hidden>
Date: Thu Oct 11 15:46:11 2012 +1100

    Avoid RPC calls while holding iptables lock.

    This exhibitied itself as very slow instance starts on a Canonical
    test cluster. This was because do_referesh_security_group_rules()
    was making rpc calls while holding the iptables lock. This refactor
    avoids that while making no functional changes (I hope).

    This should resolve bug 1062314.

    Change-Id: I36f805bd72f7bd06082cfe96c58d637203bcffb7

Changed in nova:
status: In Progress → Fix Committed
Chuck Short (zulcss) on 2012-10-12
Changed in nova (Ubuntu):
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/14367
Committed: http://github.com/openstack/nova/commit/eee4dbb07e88e19ca891275f5ba1d381bd98b4e5
Submitter: Jenkins
Branch: stable/folsom

commit eee4dbb07e88e19ca891275f5ba1d381bd98b4e5
Author: Michael Still <email address hidden>
Date: Thu Oct 11 15:46:11 2012 +1100

    Avoid RPC calls while holding iptables lock.

    This exhibitied itself as very slow instance starts on a Canonical
    test cluster. This was because do_referesh_security_group_rules()
    was making rpc calls while holding the iptables lock. This refactor
    avoids that while making no functional changes (I hope).

    Fixes bug: 1062314
    Change-Id: I36f805bd72f7bd06082cfe96c58d637203bcffb7
    Cherry-picked: ba585524e32965697c1a44c8fd743dea060bb1af

Michael Still (mikal) wrote :

Upstream has chosen not to backport this fix to essex. Can we please consider carrying this patch ourselves?

Pádraig Brady (p-draigbrady) wrote :

To clarify, it's not in 2012.1.3 but will probably be added to the 2012.1.4 branch

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2012.2-0ubuntu5

---------------
nova (2012.2-0ubuntu5) quantal-proposed; urgency=low

  [ Adam Gandelman ]
  * Move management of /var/lib/nova/volumes from nova-common to
    nova-volume. Ensure it has proper permissions. (LP: #1065320)
  * debian/patches/avoid_setuptools_git_dependency.patch: Remove
    setuptools_git from tools/pip-requires to avoid it being automatically
    added to python-nova's runtime dependencies. (LP: #1059907)

  [ Chuck Short ]
  * debian/patches/rbd-security.patch: Support override of ceph rbd
    user and secret in nova-compute. (LP: #1065883)
  * debian/patches/ubuntu/fix-libvirt-firewall-slowdown.patch: Fix
    refreshing of security groups in libvirt not to block on RPC calls.
    (LP: #1062314)
  * debian/patches/ubuntu/fix-ec2-volume-id-mappings.patch: Read deleted
    snapshot and volume id mappings. (LP: #1065785)
 -- Chuck Short <email address hidden> Fri, 12 Oct 2012 12:35:01 -0500

Changed in nova (Ubuntu):
status: In Progress → Fix Released
Michael Still (mikal) on 2012-10-16
Changed in nova:
importance: Undecided → High

Reviewed: https://review.openstack.org/14368
Committed: http://github.com/openstack/nova/commit/86a59376c6d4d2ec5a05c8b63a419d0374e9d9c9
Submitter: Jenkins
Branch: stable/essex

commit 86a59376c6d4d2ec5a05c8b63a419d0374e9d9c9
Author: Michael Still <email address hidden>
Date: Thu Oct 11 15:46:11 2012 +1100

    Avoid RPC calls while holding iptables lock.

    This exhibitied itself as very slow instance starts on a Canonical
    test cluster. This was because do_referesh_security_group_rules()
    was making rpc calls while holding the iptables lock. This refactor
    avoids that while making no functional changes (I hope).

    Fixes bug: 1062314
    Change-Id: I36f805bd72f7bd06082cfe96c58d637203bcffb7
    Cherry-picked: ba585524e32965697c1a44c8fd743dea060bb1af
    Conflicts:
     nova/virt/firewall.py

Thierry Carrez (ttx) on 2012-11-21
Changed in nova:
milestone: none → grizzly-1
status: Fix Committed → Fix Released
Changed in nova (Ubuntu Quantal):
status: New → Confirmed

Hello Michael, or anyone else affected,

Accepted nova into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/nova/2012.2.1+stable-20121212-a99a802e-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Quantal):
status: Confirmed → Fix Committed
tags: added: verification-needed
Launchpad Janitor (janitor) wrote :
Download full text (8.3 KiB)

This bug was fixed in the package nova - 2012.2.1+stable-20121212-a99a802e-0ubuntu1

---------------
nova (2012.2.1+stable-20121212-a99a802e-0ubuntu1) quantal-proposed; urgency=low

  * Ubuntu updates:
    - debian/control: Ensure novaclient is upgraded with nova,
      require python-keystoneclient >= 1:2.9.0. (LP: #1073289)
    - d/p/avoid_setuptools_git_dependency.patch: Refresh.
  * Dropped patches, applied upstream:
    - debian/patches/CVE-2012-5625.patch: [a99a802]
  * Resynchronize with stable/folsom (b55014ca) (LP: #1085255):
    - [a99a802] create_lvm_image allocates dirty blocks (LP: #1070539)
    - [670b388] RPC exchange name defaults to 'openstack' (LP: #1083944)
    - [3ede373] disassociate_floating_ip with multi_host=True fails
      (LP: #1074437)
    - [22d7c3b] libvirt imagecache should handle shared image storage
      (LP: #1075018)
    - [e787786] Detached and deleted RBD volumes remain associated with insance
      (LP: #1083818)
    - [9265eb0] live_migration missing migrate_data parameter in Hyper-V driver
      (LP: #1066513)
    - [3d99848] use_single_default_gateway does not function correctly
      (LP: #1075859)
    - [65a2d0a] resize does not migrate DHCP host information (LP: #1065440)
    - [102c76b] Nova backup image fails (LP: #1065053)
    - [48a3521] Fix config-file overrides for nova-dhcpbridge
    - [69663ee] Cloudpipe in Folsom: no such option: cnt_vpn_clients
      (LP: #1069573)
    - [6e47cc8] DisassociateAddress can cause Internal Server Error
      (LP: #1080406)
    - [22c3d7b] API calls to dis-associate an auto-assigned floating IP should
      return proper warning (LP: #1061499)
    - [bd11d15] libvirt: if exception raised during volume_detach, volume state
      is inconsistent (LP: #1057756)
    - [dcb59c3] admin can't describe all images in ec2 api (LP: #1070138)
    - [78de622] Incorrect Exception raised during Create server when metadata
      over 255 characters (LP: #1004007)
    - [c313de4] Fixed IP isn't released before updating DHCP host file
      (LP: #1078718)
    - [f4ab42d] Enabling Return Reservation ID with XML create server request
      returns no body (LP: #1061124)
    - [3db2a38] 'BackupCreate' should accept rotation parameter greater than or
      equal to zero (LP: #1071168)
    - [f7e5dde] libvirt reboot sometimes fails to reattach volumes
      (LP: #1073720)
    - [ff776d4] libvirt: detaching volume may fail while terminating other
      instances on the same host concurrently (LP: #1060836)
    - [85a8bc2] Used instance uuid rather than id in remove-fixed-ip
    - [42a85c0] Fix error on invalid delete_on_termination value
    - [6a17579] xenapi migrations fail w/ swap (LP: #1064083)
    - [97649b8] attach-time field for volumes is not updated for detach volume
      (LP: #1056122)
    - [8f6a718] libvirt: rebuild is not using kernel and ramdisk associated with
      the new image (LP: #1060925)
    - [fbe835f] live-migration and volume host assignement (LP: #1066887)
    - [c2a9150] typo prevents volume_tmp_dir flag from working (LP: #1071536)
    - [93efa21] Instances deleted during spawn leak network allocations
      (LP: #1068716)
    - [ebabd02] After restartin...

Read more...

Changed in nova (Ubuntu Quantal):
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-04-04
Changed in nova:
milestone: grizzly-1 → 2013.1
Brian Murray (brian-murray) wrote :

Hello Michael, or anyone else affected,

Accepted nova into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/nova/2012.1.3+stable-20130423-e52e6912-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/14326
Stable review: https://review.openstack.org/14368

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

Yolanda Robla (yolanda.robla) wrote :

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/14326
Stable review: https://review.openstack.org/14368

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Yolanda Robla (yolanda.robla) wrote :

Test coverage log.

tags: added: verification-done
removed: verification-needed

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Sean Dague (sdague) on 2014-09-19
no longer affects: nova/essex
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers