Network interface allocation corrupts instance info cache

Bug #1501735 reported by Mark Goddard
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Sahid Orentino
Liberty
Fix Released
Medium
Mark Goddard

Bug Description

Allocation of network interfaces for an instance can result in corruption of the instance info cache in Nova. The result is that the cache may contain duplicate entries for network interfaces. This can cause failure to boot nodes, as seen with the Libvirt driver.

Seen on Ubuntu / devstack / commit b0013d93ffeaed53bc28d9558def26bdb7041ed7.

The issue can be reproduced using an instance with a large number of interfaces, for example using the heat stack in the attached YAML file heat-stack-many-interfaces.yaml. For improved reproducibility, add a short sleep in nova.network.neutronv2.api.API.allocate_for_instance, just before the call to self.get_instance_nw_info.

This issue was found by SecurityFun23 when testing the fix for bug #1467581.

The problem appears to be that in nova.network.neutronv2.api.API.allocate_for_instance, after the Neutron API calls to create/update ports, but before the instance info cache is updated in get_instance_nw_info, it is possible for another request to refresh the instance info cache. This will cause the new/updated ports to be added to the cache as they are discovered in Neutron. Then, the original request resumes, and unconditionally adds the new interfaces to the cache. This results in duplicate entries. The most likely candidate for another request is probably Neutron network-change notifications, which are triggered by the port update/create operation. The allocation of multiple interfaces is more likely to make the problem to occur, as Neutron API requests are made serially for each of the ports, allowing time for the notifications to arrive.

The perceived problem in a more visual form:

Request:
- Allocate interfaces for an instance (nova.network.neutronv2.api.API.allocate_for_instance)
- n x Neutron API port create/updates
------------------------------
Notification:
- External event notification from Neutron - network-changed (nova.compute.manager.ComputeManager.external_instance_event)
- Refresh instance network cache (network_api.get_instance_nw_info)
- Query ports for device in Neutron
- Add new ports to instance info cache
-------------------------------
Request:
- Refresh instance network cache with new interfaces (get_instance_nw_info)
- Unconditionally add duplicate interfaces to cache.

Revision history for this message
Mark Goddard (mgoddard) wrote :
Changed in nova:
assignee: nobody → Mark Goddard (mgoddard)
Revision history for this message
Mark Goddard (mgoddard) wrote :

Possibly caused by the fix for bug #1407664, which allows an empty instance info cache to be updated based on Neutron ports. The unit test 'test_get_instance_nw_info_ignores_neutron_ports' seems to suggest that during a cache refresh, ports seen in Neutron with a matching instance/device ID should be ignored. However, the test only covers the case where there is already a network in the cache. Adding a similar test case for when there is no network in the cache currently fails. Without having looked into it too much, I think that the fix for bug #1467581 may also fix #1407664, in which case that fix could potentially be reverted.

Revision history for this message
Mark Goddard (mgoddard) wrote :

Admittedly, the fix for bug #1407664 went in before that for #1467581.

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
tags: added: network
Revision history for this message
Mark Goddard (mgoddard) wrote :
Download full text (3.8 KiB)

On Ubuntu 14.04 / devstack / stable/kilo:

I wrote a script to boot an instance with 10 network interfaces using the heat stack attached, then check that all interfaces are present in the instance info cache without duplicates. This covers this bug and bug #1407664.

#!/bin/bash

(source openrc admin ; neutron quota-update --network -1 --subnet -1)

source openrc demo

EMPTY=0
FULL=0
for i in $(seq 1 20) ; do
        echo "Attempt $i"
        heat stack-create -f stack.yaml stack
        if [ $? -ne 0 ] ; then
                echo "heat stack-create failed"
                exit 1
        fi
        while heat stack-show stack | grep 'CREATE_IN_PROGRESS' ; do
                echo "in progress - sleep"
                sleep 5
        done
        nova list
        INTS=$(sudo virsh dumpxml $(sudo virsh list --name) | grep '<interface' | wc -l)
        if [ $INTS -ne 10 ] ; then
                echo "unexpected # of interfaces $INTS"
                exit 1
        fi
        if ! nova show instance | grep testnet ; then
                echo "empty cache"
                EMPTY=$((EMPTY + 1))
        else
                FULL=$((FULL + 1))
        fi
        heat stack-delete stack
        while heat stack-show stack > /dev/null ; do
                echo Sleeping
                sleep 10
        done
        echo empty $EMPTY full $FULL
        echo ""
done

I also increased the likelihood of hitting the race condition with the following patch:

diff --git a/nova/compute/manager.py b/nova/compute/manager.py
index 99c5b04..907ffd7 100644
--- a/nova/compute/manager.py
+++ b/nova/compute/manager.py
@@ -6540,6 +6540,7 @@ class ComputeManager(manager.Manager):
                       {'event': event.key},
                       instance=instance)
             if event.name == 'network-changed':
+ self.network_api.net_changed.wait()
                 self.network_api.get_instance_nw_info(context, instance)
             else:
                 self._process_instance_event(instance, event)
diff --git a/nova/network/neutronv2/api.py b/nova/network/neutronv2/api.py
index f80a999..8c1fc06 100644
--- a/nova/network/neutronv2/api.py
+++ b/nova/network/neutronv2/api.py
@@ -228,6 +228,8 @@ class API(base_api.NetworkAPI):
         super(API, self).__init__(skip_policy_check=skip_policy_check)
         self.last_neutron_extension_sync = None
         self.extensions = {}
+ import threading
+ self.net_changed = threading.Event()

     def setup_networks_on_host(self, context, instance, host=None,
                                teardown=False):
@@ -523,6 +525,7 @@ class API(base_api.NetworkAPI):
         created_port_ids = []
         ports_in_requested_order = []
         nets_in_requested_order = []
+ self.net_changed.clear()
         for request in ordered_networks:
             # Network lookup for available network_id
             network = None
@@ -580,6 +583,7 @@ class API(base_api.NetworkAPI):
             port_ids=ports_in_requested_order,
             admin_client=admin_client,
             preexisting_port_ids=preexisting_port_ids)
+ self.net_changed.set()
         # NOTE(danms): Only return info about ports we cr...

Read more...

Revision history for this message
SecurityFun23 (securityfun23) wrote :

Mark,

      Just to provide more info around these related issues, in our setup we applied the fix for #1407664 to fix problems that we were seeing on VM Creation. That definitely helped, but we still had issues (especially when many interfaces were used). So then we looked into the fix for #1467581. That resulted in the issue that you have documented here. I had tried an alternate modification to fix this bug (which seemed to work better), but the removal of the code for #1407664 may be more appropriate. I will try that in our setup soon.

Thanks!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/230919

Changed in nova:
status: Confirmed → In Progress
Changed in nova:
assignee: Mark Goddard (mgoddard) → Roman Podoliaka (rpodolyaka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/260462

Changed in nova:
assignee: Roman Podoliaka (rpodolyaka) → sahid (sahid-ferdjaoui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/265363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by sahid (<email address hidden>) on branch: master
Review: https://review.openstack.org/265363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by sahid (<email address hidden>) on branch: master
Review: https://review.openstack.org/260462

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by sahid (<email address hidden>) on branch: master
Review: https://review.openstack.org/265363

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/252565
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3031adb857993d8196b4c9febca51ac82cf35fd6
Submitter: Jenkins
Branch: master

commit 3031adb857993d8196b4c9febca51ac82cf35fd6
Author: David Edery <email address hidden>
Date: Wed Dec 2 19:56:54 2015 +0100

    ports & networks gather should validate existance

    _gather_port_ids_and_networks assumes that the input networks variable
    doesn't contain the networks in ifaces. This is a wrong assumption
    ever since the introduction of the "refresh_cache-<instance id>" locks to
    the process in the Liberty cycle (see related bugs) and the
    "Refactor network API 'get_instance_nw_info'" patchset
    (https://review.openstack.org/#/c/146036/).

    The fix validates that the networks stated in ifaces doen't exist in the
    gotten networks list.

    Duplicate networks were observed at the following closed/related bugs.

    Change-Id: I8c2c9e3c89babbe5e48c5129b9854013690b38f6
    Closes-Bug: #1522112
    Related-Bug: #1467581
    Related-Bug: #1501735

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/270686

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/270814

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/230919
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8694c1619d774bb8a6c23ed4c0f33df2084849bc
Submitter: Jenkins
Branch: master

commit 8694c1619d774bb8a6c23ed4c0f33df2084849bc
Author: Mark Goddard <email address hidden>
Date: Thu Oct 1 17:37:45 2015 +0100

    network: Don't repopulate instance info cache from Neutron ports

    Allocation of network interfaces for an instance can result in
    corruption of the instance info cache. The result is that the cache
    may contain duplicate entries for network interfaces. This can cause
    instance boot failure. This bug appears to be attributable to the
    combined effects of the fixes for bugs #1467581 and #1407664.

    This change reverts the fix for bug #1407664, whilst keeping a
    modified version of the unit test that was added with it. It also
    adds a second unit test.

    Change-Id: I53d5284907d44ae8b5546993f8fd461b385c39e6
    Closes-bug: #1501735
    Related-bug: #1467581
    Related-bug: #1407664

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/270686
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=adc21bb0c62db822c3d656c4721f7b7c3ad41b7a
Submitter: Jenkins
Branch: stable/liberty

commit adc21bb0c62db822c3d656c4721f7b7c3ad41b7a
Author: David Edery <email address hidden>
Date: Wed Dec 2 19:56:54 2015 +0100

    ports & networks gather should validate existance

    _gather_port_ids_and_networks assumes that the input networks variable
    doesn't contain the networks in ifaces. This is a wrong assumption
    ever since the introduction of the "refresh_cache-<instance id>" locks to
    the process in the Liberty cycle (see related bugs) and the
    "Refactor network API 'get_instance_nw_info'" patchset
    (https://review.openstack.org/#/c/146036/).

    The fix validates that the networks stated in ifaces doen't exist in the
    gotten networks list.

    Duplicate networks were observed at the following closed/related bugs.

    Change-Id: I8c2c9e3c89babbe5e48c5129b9854013690b38f6
    Closes-Bug: #1522112
    Related-Bug: #1467581
    Related-Bug: #1501735
    (cherry picked from commit 3031adb857993d8196b4c9febca51ac82cf35fd6)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/270814
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ce392292353c2b289aa3247b3c401f1cf7e1506c
Submitter: Jenkins
Branch: stable/liberty

commit ce392292353c2b289aa3247b3c401f1cf7e1506c
Author: Mark Goddard <email address hidden>
Date: Thu Oct 1 17:37:45 2015 +0100

    network: Don't repopulate instance info cache from Neutron ports

    Allocation of network interfaces for an instance can result in
    corruption of the instance info cache. The result is that the cache
    may contain duplicate entries for network interfaces. This can cause
    instance boot failure. This bug appears to be attributable to the
    combined effects of the fixes for bugs #1467581 and #1407664.

    This change reverts the fix for bug #1407664, whilst keeping a
    modified version of the unit test that was added with it. It also
    adds a second unit test.

    Conflicts:
     nova/tests/unit/network/test_neutronv2.py

    This is because If0a1f3850d92d011aae32ae34e8c8664f2ee4313 isn't in
    stable/liberty

    Change-Id: I53d5284907d44ae8b5546993f8fd461b385c39e6
    Closes-bug: #1501735
    Related-bug: #1467581
    Related-bug: #1407664
    (cherry picked from commit 8694c1619d774bb8a6c23ed4c0f33df2084849bc)

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/nova 12.0.4

This issue was fixed in the openstack/nova 12.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/640516

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.