Bad performance problem with nova.virt.firewall

Bug #1016633 reported by David Kranz
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Hans Lindgren

Bug Description

I was trying to figure out why creating 1,2,4 servers in parallel on an 8-core machine did not show any speedup. I found a
problem shown in this log snippet with 4 servers. The pair of calls producing the debug messages are separated only by
a single call.

    def prepare_instance_filter(self, instance, network_info):
        # make sure this is legacy nw_info
        network_info = self._handle_network_info_model(network_info)

        self.instances[instance['id']] = instance
        self.network_infos[instance['id']] = network_info
        self.add_filters_for_instance(instance)
        LOG.debug(_('Filters added to instance'), instance=instance)
        self.refresh_provider_fw_rules()
        LOG.debug(_('Provider Firewall Rules refreshed'), instance=instance)
        self.iptables.apply()

Note the interleaving of the last two calls in this log snippet and how long they take:

Jun 22 10:52:09 xg06eth0 2012-06-22 10:52:09 DEBUG nova.virt.firewall [req-14689766-cc17-4d8d-85bb-c4c19a2fc88d demo demo] [instance: 4c5a43af-04fd-4aa0-818e-8e0c5384b279] Filters added to instance from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:151

Jun 22 10:52:10 xg06eth0 2012-06-22 10:52:10 DEBUG nova.virt.firewall [req-14689766-cc17-4d8d-85bb-c4c19a2fc88d demo demo] [instance: 4c5a43af-04fd-4aa0-818e-8e0c5384b279] Provider Firewall Rules refreshed from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:153

Jun 22 10:52:18 xg06eth0 2012-06-22 10:52:18 DEBUG nova.virt.firewall [req-c9ed42e0-1eed-418a-ba37-132bcc26735c demo demo] [instance: df15e7d6-657e-4fd7-a4eb-6aab1bd63d5b] Filters added to instance from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:151

Jun 22 10:52:19 xg06eth0 2012-06-22 10:52:19 DEBUG nova.virt.firewall [req-c9ed42e0-1eed-418a-ba37-132bcc26735c demo demo] [instance: df15e7d6-657e-4fd7-a4eb-6aab1bd63d5b] Provider Firewall Rules refreshed from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:153

Jun 22 10:52:19 xg06eth0 2012-06-22 10:52:19 DEBUG nova.virt.firewall [req-2daf4cb8-73c5-487a-9bf6-bea08125b461 demo demo] [instance: 765212a6-cc23-4d5a-b252-5fa6b5f8331e] Filters added to instance from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:151

Jun 22 10:52:25 xg06eth0 2012-06-22 10:52:25 DEBUG nova.virt.firewall [req-5618e93e-3af1-4c65-b826-9d38850a215d demo demo] [instance: fa6423ac-82b8-419b-a077-f2d44d081771] Filters added to instance from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:151

Jun 22 10:52:38 xg06eth0 2012-06-22 10:52:38 DEBUG nova.virt.firewall [req-2daf4cb8-73c5-487a-9bf6-bea08125b461 demo demo] [instance: 765212a6-cc23-4d5a-b252-5fa6b5f8331e] Provider Firewall Rules refreshed from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:153

Jun 22 10:52:52 xg06eth0 2012-06-22 10:52:52 DEBUG nova.virt.firewall [req-5618e93e-3af1-4c65-b826-9d38850a215d demo demo] [instance: fa6423ac-82b8-419b-a077-f2d44d081771] Provider Firewall Rules refreshed from (pid=15704) prepare_instance_filter /opt/stack/nova/nova/virt/firewall.py:153

Tags: performance
Revision history for this message
David Kranz (david-kranz) wrote :

I tried Vish's suggested fix http://paste.openstack.org/show/18764/ but the result was that the instances never make it out of the BUILD state.

Revision history for this message
Thierry Carrez (ttx) wrote :

Probably linked to the fact that your hosts are created on the same machine and the calls are synchronized.

Changed in nova:
importance: Undecided → Wishlist
status: New → Confirmed
tags: added: performance
Revision history for this message
David Kranz (david-kranz) wrote :

OK, so I grepped out the "instance:" lines for a 4-node run that took more than a minute and a 1-node run that took 10 seconds to create five files that can be compared. I noticed several strange things for the 4-node logs.

First, there was a 14 second hiccup on all four between "Starting instance..." and "Instance network info due to a call to heal_instance_info_cache. Since this fires every minute I assume it is not supposed to take 14 seconds.

Just 2 of the 4, and not the one, had an "iptables firewall:" line, and one of them took 17 seconds to get to the next thing.

I am attaching the files.

Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :
Revision history for this message
David Kranz (david-kranz) wrote :

I tried some runs with 8 instances on a host with 8 cores. There were several large gaps indicating serialization. The biggest was in the update_available_resource method in nova/virt/libvirt/connection.py. It was taking as much as 30 seconds. Most of that time seemed to be in getting vcpu_used, cpu_info, and disk_available_least.

For the multiple cpu case there is also always a 5 second gap between

IPTablesManager.apply completed with success

and

 Running cmd (subprocess): sudo /usr/local/bin/nova-rootwrap /etc/nova/rootwrap.conf tee /sys/class/net/vnet7/brport/hairpin_mode

All in all, the main problem seems to be that the periodic manager processes (green threads?) are doing slow operations and not sufficiently yielding to the "real" work that is trying to launch instances that truly run in parallel native processes. But I don't understand the nova threading/rpc/etc. architecture enough to really say.

Revision history for this message
David Kranz (david-kranz) wrote :

Thierry, I don't agree that this is a wishlist item. At this point the bug title is not really right. The situation is that launching n instances in an overlapping way takes at least as long, and perhaps longer, than launching them serially. We may decide it can't be fixed for Folsom but it is a real bug. I understand that this may not seem important for the use case where you have large numbers of public cloud users randomly launching vms on a large cluster but many of us have more targeted applications that view a cluster as a large set of cores and may take advantage of the fact that certain groups of cores have certain locality properties.

Revision history for this message
Thierry Carrez (ttx) wrote :

Set to Medium instead, ping me if you disagree.

Changed in nova:
importance: Wishlist → Medium
Revision history for this message
Jaroslav Pulchart (jaroslav-pulchart-4) wrote :

Hello

I hit this issue when I'm deploying instances in parallel at one compute node with disabled allow_same_net_traffic.

It cause that instances are in error state due to openstack's RPC calls timeout errors... I have setup with 5minutes rpc limit ...:(

Jaroslav

Revision history for this message
Vish Ishaya (vishvananda) wrote :

This needs to be verified with current trunk. A fix went in that should have minimized this:

https://review.openstack.org/#/c/14326/

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Jaroslav Pulchart (jaroslav-pulchart-4) wrote :

I will use attached patch until fix from #13 is not verified.

information type: Public → Public Security
information type: Public Security → Private Security
information type: Private Security → Public
Revision history for this message
Thierry Carrez (ttx) wrote :

Anyone could confirm that the issue is gone with the patch mentioned in comment 13 ?

Revision history for this message
Jaroslav Pulchart (jaroslav-pulchart-4) wrote :

Still to slow for me without patch from #14.

Chuck Short (zulcss)
Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
David Kranz (david-kranz) wrote :

The current situation on trunk is that creating n instances in parallel gets about 1/2 the speedup that would be hoped for. That is, starting up 4 instances takes twice as long as starting 1. That is a lot better than when I reported this bug. I don't know how much serialization is minimally required so don't know how good this is.

Revision history for this message
Joe Gordon (jogo) wrote :

Is this still valid, this bug is fairly old and a lot has changed since this was last updated

Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Jaroslav Pulchart (jaroslav-pulchart-4) wrote :

Hello,

this issue is valid for all nova-network based deployment. I didn't check it at Neutron, but I expect same behavior if the concept of collecting IPs for each security group is based on 3x for cycles and running RPC call in each iteration (I didn't check it directly).

Sean Dague (sdague)
Changed in nova:
status: Incomplete → Confirmed
Changed in nova:
assignee: nobody → Hans Lindgren (hanlind)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/184027
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c5e2d4b6b4dfa64f898249d6ed871cf6d41c58a2
Submitter: Jenkins
Branch: master

commit c5e2d4b6b4dfa64f898249d6ed871cf6d41c58a2
Author: Hans Lindgren <email address hidden>
Date: Thu Apr 23 22:43:24 2015 +0200

    Remove unused provider firewall rules functionality in nova

    Provider firewall rules functionality is not in use and hasn't been
    for a very long time. The api for this was removed in [1] and db api
    methods for adding/removing rows in the associated db table have not
    been used since.

    Stop refreshing those rules as it is essentially a no-op and indeed a
    costly one that includes a rpc round trip to the conductor to get
    back an always empty db result. This should have a positive impact on
    instance boot performance since the conductor call happens to live
    inside an externally syncronized block of code.

    Removes related compute rpcapi/manager code that were missed in a
    recent cleanup[2]. Since this functionality hasn't been in use since
    Havana timeframe(!), it should be fairly safe to remove without first
    deprecating it.

    Also removes the now unused virtapi method provider_fw_rule_get_all()
    and the virtapi itself from virt firewall driver initialization.

    [1] Commit: 62d5fae8d11b6403f9a63a709270ffafebb7ef09
    [2] Commit: e6f7d8041783a0b3c740559d97b8d40b3568f214

    Change-Id: Ifbb2514b9bc1445eaa07dcfe172c7405fd1a58f7
    Partial-Bug: #1016633

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Hand Lindgren: What's missing to fix this bug completely?

Revision history for this message
Hans Lindgren (hanlind) wrote :

Not sure. My change removed one bottleneck involving a synchronized RPC call but I don't know if there are others. Maybe someone reporting having this problem can verify things now work as expected and we can close this bug.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Hans Lindgren: Thanks for the feedback. This bug report is really old
and I doubt that the current "medium" importance is still valid. I
close this as fix released, thanks for you patch.

@David Kranz (+ other stakeholders): Please double-check if the
issue is fixed from your perspective. Use the current master (Newton)
code for that. If it is not fixed, please reopen and provide some
information how you tested this.

Changed in nova:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.