auto_allocated_topology network creation isn't atomic per tenant

Bug #1591766 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Armando Migliaccio

Bug Description

I've written a tempest test which creates 2 servers in nova which don't have any networking available in the tenant when the server create request is made. This invokes the auto_allocated_topology API in neutron to create the networking resources for the tenant.

The test failed because it expects at the end that there is only one private network for the tenant created but in this case there were two:

http://logs.openstack.org/01/327901/1/check/gate-tempest-dsvm-neutron-full/9972b81/console.html#_2016-06-12_15_19_25_240

Showing that the network create wasn't atomic per tenant in concurrent requests from nova's compute manager.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Currently neutron does catch this and recover from it[1], but there is a small window where the network will be visible to the networks listing. Network creation can require the coordination of an external backend so it's hard to do this in an atomic fashion.

We can definitely figure out a way to hide it if we have to, but is it possible for Nova to just call the auto allocate API to get the network that it needs to use? During the development of auto allocate we discussed this specific issue and assumed nova would just get the auto allocated topology via the auto allocate API to avoid the problem.

1. http://logs.openstack.org/01/327901/1/check/gate-tempest-dsvm-neutron-full/9972b81/logs/screen-q-svc.txt.gz#_2016-06-12_14_46_57_349

Revision history for this message
Matt Riedemann (mriedem) wrote :

As discussed with Kevin in IRC, we still have a potential race in nova because nova is checking for available networks when none are provided in the boot request, it gets the list by checking:

1. private tenant-owned networks
2. public shared=True networks

If that's empty, and auto was requested, nova calls the auto-allocated-topology API.

But let's say there are 3 concurrent server create requests for the tenant.

1. first gets in and creates the 'real' network
2. second gets in and calls auto-allocated-topology because it didn't see the network from #1 yet, but neutron sees the network from #1 and logs that error and returns the network from #1.
3. the third request is building the list of available networks, and sees the first two networks created already - this will result in an error on the nova side:

https://review.openstack.org/#/c/316275/12/nova/network/neutronv2/api.py@574

Because of the network ambiguous error.

If there was some kind of way to filter the networks up front on the nova side to avoid getting networks 1 and 2 from the scenario above, like if the networks status was ALLOCATING or something, then maybe nova could filter the bad one out up front.

Realistically this is an edge case, and it would be a one time failure for a tenant that doesn't already have a network and is doing multiple boot requests on that first try - but for a new user it would be a pretty crappy experience having your first server create request fail.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The issue from the bug description should be transient, as Kevin pointed out, therefore a simple wait_until_true should solve the issue.

Having said that, the chance of 'picking up' networks unwillingly is a behavior that predated the introduction of the auto allocated extension. Even without the auto allocation logic, a vm boot request that interleaves with a network create request may lead to a 'multiple network exception' depending on the state of the deployment. Obviously the use of the extension makes the issue more likely in that now Nova is the one trying to create and getting networks during the boot process.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/329006

Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
status: Confirmed → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

To be honest I am not overly excited about the proposed fix as it seems tailormade for making [1] not go into error. But this is one of the possible strategies that could be pursued. The other would be to make Nova filter out the auto allocated networks during whatever piece of code trigger the transient creation.

[1] https://review.openstack.org/#/c/316275/12/nova/network/neutronv2/api.py@574

Changed in neutron:
importance: High → Low
Matt Riedemann (mriedem)
tags: added: mitaka-backport-potential
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Kevin Benton (kevinbenton)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I'll change the approach according to the last empirical experiments.

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/331424
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=877778ee4c7f0e83b54f54b2d7bafec98f89626c
Submitter: Jenkins
Branch: master

commit 877778ee4c7f0e83b54f54b2d7bafec98f89626c
Author: Armando Migliaccio <email address hidden>
Date: Sat Jun 18 08:19:03 2016 -0700

    Move DHCP notification logic out of API controller

    Bug 1591766 unveiled an issue where calling the plugin API does not trigger
    DHCP notifications. This is required by the auto-allocated-topology service
    plugin that calls core_plugin.update_network(), and expect notifications
    to be sent out on state changes. To accomplish this, the logic has been
    encapsulated in the DHCP module, and leveraged via callback mechanisms.

    For this reason, new events have been introduced, AFTER_REQUEST, and
    BEFORE_RESPONSE. The latter in particular is the one needed to hook up
    dhcp notifications in order to preserve backward compatibility.

    More precisely, core plugins that use DHCP as is or implement their own,
    (with or without an agent) should already instantiate their own notifier,
    and if they do not, this should be rectified.

    A search on codesearch.openstack.org reveals that out-of-tree plugins
    already specify their own notifiers, and the default initialization is
    clearly redundant now.

    Related-bug: #1591766

    Change-Id: I7440becb6d30af7159ecaeba09d7a28eceb71bea

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/329006
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d91a4e1930bd8f05f7b2055a2278f5bd788cd6b4
Submitter: Jenkins
Branch: master

commit d91a4e1930bd8f05f7b2055a2278f5bd788cd6b4
Author: Armando Migliaccio <email address hidden>
Date: Mon Jun 13 05:43:37 2016 -0700

    Create auto allocated networks in disabled state

    Under particular circumstances, multiple requests to the
    auto-allocated-topology extension may lead to the transient
    creation of duplicated resources. This is dealt with by the
    service plugin code, which cleans them up once the condition
    is detected. However the client may accidentally be impacted
    and potentially left in error (recoverable on retry).

    In order to address this error condition, the logic to
    provision the network for any given tenant is tweaked
    slightly so that the network is created in disabled state
    and re-enabled when it is safe to do so. A Neutron client
    should check the network status to see if the network is
    ready for use before getting its hands on it.

    Closes-bug: #1591766

    Change-Id: Ia6ff5ad975673875216eb470080dfc0dcf6b9ab2

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I am not 100% convinced that the cherry pick is necessary here. The fix relies on some refactoring, which may render other cherry picks more difficult.

Matt Riedemann (mriedem)
tags: removed: mitaka-backport-potential
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b2

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/361634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/361636

tags: added: mitaka-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/mitaka)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/361634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/361636

tags: removed: mitaka-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.