Ubuntu Cloud Archive

instance creation fails with "Failed to allocate the network(s), not rescheduling." because neutron-ovs-agent rpc_loop took too long

Bug #1763442 reported by Junien F on 2018-04-12

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Ubuntu Cloud Archive	New	Undecided	Unassigned
	neutron (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Hi,

I'm running a cloud with 18.02 charms, and origin cloud:xenial-ocata. We use openvswitch and GRE tunnels for networking. We use l2pop.

Today I investigated an instance creation failure. The "fault" message in "nova show" was the following :

Build of instance 59ea33ee-b167-48e9-ba61-e041aa3c0059 aborted: Failed to allocate the network(s), not rescheduling.", "code": 500, "details": " File \"/usr/lib/python2.7/dist-packages/nova/compute/manager.py\", line 1788, in _do_build_and_run_instance

nova-compute log revealed the following :

2018-04-12 05:58:18.639 1852 ERROR nova.compute.manager [instance: 59ea33ee-b167-48e9-ba61-e041aa3c0059] VirtualInterfaceCreateException: Virtual Interface creation failed

because the vif-plugged event never came in (below is the "preparing for event" message) :

2018-04-12 05:53:14.539 1852 DEBUG nova.compute.manager [req-9fd7bef1-8c17-4659-afdd-ac490a7d548e eacfab4914674b3d9225255198b10fcd 169e085b0bc7408490e8165dc1d1b19c - - -] [instance: 59ea33ee-b167-48e9-ba61-e041aa3c0059] Preparing to wait for external event network-vif-plugged-16389f31-d9b0-4fba-94cc-a28c1f504b19 prepare_for_instance_event /usr/lib/python2.7/dist-packages/nova/compute/manager.py:328

I got a bit surprised because everything appeared to have worked well. Looking at neutron-ovs-agent logs, I found the reason the event never got sent : the rpc_loop that was running when the instance got created lasted more than 5 minutes !

Beginning of rpc_loop :
2018-04-12 05:52:59.287 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 started rpc_loop /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1951

End of rpc_loop :
2018-04-12 05:58:37.857 3796 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Agent rpc_loop - iteration:549380 completed. Processed ports statistics: {'regular': {'updated': 7, 'added': 7, 'removed': 7}}. Elapsed:338.570 loop_count_and_wait /usr/lib/python2.7/dist-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1771

As you can see, it lasted 338.570 seconds. Most of this time was spent running "conntrack -D" commands, such as :
2018-04-12 05:55:44.984 3796 DEBUG neutron.agent.linux.utils [req-3eba7d9e-9533-4b87-a24f-d88d52a24876 - - - - -] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'conntrack', '-D', '-f', 'ipv4', '-d', '10.X.Y.Z', '-w', '10014', '-s', '10.A.B.C'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:97

There were 2 batches of those : a first one with 176 commands, and a second one with 194 commands. With each command taking ~0.90 seconds (the machines are pretty heavily loaded, it's the call to neutron-rootwrap that takes most of the time, not the conntrack deletion itself), it does add up.

These conntrack commands are created from /usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_firewall.py, function update_security_group_members() which then calls _update_ipset_members() which does the following :

for ip_version, current_ips in sg_members.items():
    add_ips, del_ips = self.ipset.set_members(
        sg_id, ip_version, current_ips)
    if devices and del_ips:
        # remove prefix from del_ips
        ips = [str(netaddr.IPNetwork(del_ip).ip) for del_ip in del_ips]
        self.ipconntrack.delete_conntrack_state_by_remote_ips(
            devices, ip_version, ips)

I believe it's trying to remove any existing conntrack entry when a removed from a secgroup.
And it looks like there's no safeguard in place to make sure that this runs in a limited amount of time, or across multiple rpc_loops.

So nova-compute times out waiting for the vif-plugged-event, and instance creation fails.

It should be easy to reproduce if you create a bunch of instances in the same secgroup (we were at 28), delete one, and immediately create one (on the same compute node I guess ?). You can also add a small sleep of 0.5s to neutron-rootwrap to simulate the load.

You can find the full log of the rpc_loop iteration at https://pastebin.canonical.com/p/hqKb3Rg72s/ (sorry, Canonical-only link)

Thanks !

Tags:

Edward Hope-Morley (hopem) on 2018-04-30

tags:

added: sts

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-09-06:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status:	New → Confirmed

Spyros Ligouras (ligouras) on 2019-09-16

Changed in neutron (Ubuntu):
status:	Confirmed → Fix Committed
status:	Fix Committed → Confirmed

Revision history for this message

Skia (hyask) wrote on 2024-06-20:

Is this bug still relevant nowadays? autopkgtest-cloud has some workaround for this, as asked in bug #1763445.
Is this workaround something we can get rid of?

Revision history for this message

Junien F (axino) wrote on 2024-06-20:

I'm not aware of a fix for this bug, but all our ML2+OVS openstack clusters are being deprecated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.