Race condition when hard rebooting instance

Bug #1328546 reported by mouadino
78
This bug affects 18 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Li Ma
Juno
Fix Released
Undecided
Unassigned

Bug Description

Condition for this to happen:
======================

1. Agent: neutron-linuxbridge-agent.
2. Only 1 instance is running on the hypervisor that belong to this network.
3. Timing, it's a race condition after all ;-)

Remarked behavior:
================

After hard reboot instance end up in ERROR state and the nova-compute log an error saying that:

    Cannot get interface MTU on 'brqf9d0e8cf-bd': No such device

What happen:
===========

When nova do a hard reboot, the instance is first destroyed, which imply that the tap device is deleted from the linux bridge (which result to an empty bridge b/c of 2 condition above), than re-created afterward, but in between neutron-linuxbridge-agent may clean up this empty bridge as part of his remove_empty_bridges()[1], but for this error to happen neutron-linuxbridge-agent should do that after plug_vifs()[2] and before domain.createWithFlags() finish.

[1]: https://github.com/openstack/neutron/blob/stable/icehouse/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py#L449.
[2]: https://github.com/openstack/nova/blob/stable/icehouse/nova/virt/libvirt/driver.py#L3648-3656

Changed in neutron:
status: New → Incomplete
status: Incomplete → Confirmed
Revision history for this message
Cristian Tomoiaga (ctomoiaga) wrote :

The same issue (related to bridge cleanup):
https://bugs.launchpad.net/neutron/+bug/1293540

Changed in neutron:
importance: Undecided → Medium
tags: added: lb
Waqas Riaz (waqasriaz)
Changed in neutron:
assignee: nobody → Waqas Riaz (waqasriaz)
Revision history for this message
Li Ma (nick-ma-z) wrote :

Is it possible not to remove the empty bridge in the lb-agent? Any side effect?

Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

@Li Ma - yes that would be possible. In fact that is the way it worked before vxlan support was introduced in https://github.com/openstack/neutron/commit/7e79e6973e879bc14ca02977fe98e8382b507ea2 and that patch caused this bug and the cold snapshot bug (even if you are not using vxlan). Leaving unused empty bridges around is not so nice, but it is better than these bugs.

Revision history for this message
Li Ma (nick-ma-z) wrote :

I suggest to provide an option 'remove_empty_bridge' and do an offline cleanup job via script.

@Waqas Hello, do you work on this bug? What do you think of it?

Li Ma (nick-ma-z)
Changed in neutron:
assignee: Waqas Riaz (waqasriaz) → Li Ma (nick-ma-z)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/173207

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Darragh O'Reilly (darragh-oreilly) wrote :

I'm not able to reproduce this any more.

Steps:
 - create a network and subnet with dhcp disabled
 - start instance on it
    $ brctl show
    bridge name bridge id STP enabled interfaces
    brq86bd57c8-c3 8000.080027f0370f no eth2.100
           tap0ca1273d-01
 - nova reboot --hard vm1

Now nova is calling ensure_bridge after restarting the instance

DEBUG nova.network.linux_net [-] Starting Bridge brq86bd57c8-c3 ensure_bridge /opt/stack/nova/nova/network/linux_net.py:1601

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Li Ma (<email address hidden>) on branch: master
Review: https://review.openstack.org/173207
Reason: the bug is fixed in nova.

Li Ma (nick-ma-z)
Changed in neutron:
status: In Progress → Invalid
Revision history for this message
Ahmed Rahal (arahal) wrote :

Hi,

Any hint on which commit in nova fixes this specific issue ?

Thanks,

Revision history for this message
Mathieu Gagné (mgagne) wrote :

The race condition still exists.

Neutron happens to delete the bridge [1] between plug_vifs (ensure_bridge) [2] and _create_domain [3] if your instance is the only one using the bridge.

I locally fix it by disabling remove_empty_bridges and using a cron to delete empty bridges instead.

[1] https://github.com/openstack/neutron/blob/6cf92011143eb55adda180ffac91886566fc7826/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py#L926
[2] https://github.com/openstack/nova/blob/f9e664f1b6521dd7d5c02cd803e376e8abdf9c30/nova/virt/libvirt/driver.py#L4383
[3] https://github.com/openstack/nova/blob/f9e664f1b6521dd7d5c02cd803e376e8abdf9c30/nova/virt/libvirt/driver.py#L4390

Changed in neutron:
status: Invalid → New
Revision history for this message
Sean M. Collins (scollins) wrote :
Changed in neutron:
status: New → Confirmed
tags: added: linuxbridge linuxbridge-gate-parity
removed: lb
Kyle Mestery (mestery)
Changed in neutron:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/qos)

Related fix proposed to branch: feature/qos
Review: https://review.openstack.org/197990

Revision history for this message
Sean M. Collins (scollins) wrote :

A partial fix was committed in
https://review.openstack.org/197162

Changed in neutron:
status: Confirmed → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/qos)
Download full text (3.5 KiB)

Reviewed: https://review.openstack.org/197990
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8a0e11143cadfc925e6f986fb39d73e6879a8fae
Submitter: Jenkins
Branch: feature/qos

commit f1771131a85a2fe633126f354364205554ef71d1
Author: Kevin Benton <email address hidden>
Date: Wed Jul 1 13:06:38 2015 -0700

    Change the half of the bridge name used for ports

    The code to generate the names of the patch ports
    was based on a chunk of the bridge name starting from
    the beginning. With the long suffix, this ended up
    excluding all of the random characters in the name.
    (e.g. br-int374623235 would create an interface br-in-patch-tun).

    This meant that if two tests using patch interfaces ran together,
    they would have a name collision and one would fail.

    This patch updates the patch port name generation to use the
    randomized back portion of the name.

    Change-Id: I172e0b2c0b53e8c7151bd92f0915773ea62c0c6a
    Closes-Bug: #1470637

commit 49569327c20d8a10ba3d426833ff28d68b1b7a27
Author: armando-migliaccio <email address hidden>
Date: Wed Jul 1 12:00:14 2015 -0700

    Fix log traces induced by retry decorator

    Patch 4e77442d5 added a retry decorator to the API layer
    to catch DB deadlock errors. However, when they occur, the
    retried operation ends up being ineffective because the original
    body has been altered, which leads the notification and validation
    layers to barf exceptions due to unrecognized/unserializable elements.

    This ultimately results to an error reported to the user.

    To address this, let's make a deep copy of the request body, before
    we pass it down to the lower layers. This allows the decorator to
    work on a pristine copy of the body on every attempt. The performance
    impact for this should be negligible.

    Closes-bug: #1470615

    Change-Id: I82a2a002612d28fa8f97b0afbd4f7ba1e8830377

commit cf8c9e40c8720036bd0c06bd8370f88a472e3e6f
Author: Fawad Khaliq <email address hidden>
Date: Tue Jun 30 02:17:19 2015 -0700

    Update PLUMgrid plugin information

    README was quite oudated and created confusion
    among users.

    Updated the information after decomposition.

    Change-Id: I78bf8dec20ba2ceb644d4565035d29bbf53cb3b5

commit 8dd8a7d93564168b98fa2350eedf56acede42b0f
Author: Sean M. Collins <email address hidden>
Date: Tue Jun 30 12:06:07 2015 -0400

    Remove bridge cleanup call

    Remove the bridge cleanup call to delete bridges, since we are seeing
    race conditions where bridges are deleted, then new interfaces are
    created and are attempting to plug into the bridge before it is
    recreated.

    Change-Id: I4ccc96566a5770384eacbbdc492bf09a514f5b31
    Related-Bug: #1328546

commit 4dc68ea88bf4f07b13253bf9eeedffe22b1f8013
Author: Kevin Benton <email address hidden>
Date: Thu May 28 23:13:19 2015 -0700

    Read vif port information in bulk

    During startup, the agent was making many calls per port
    to read information about the current VLAN, external ID, etc.
    This resulted in hundreds of calls just to read information about
    a relatively small num...

Read more...

tags: added: in-feature-qos
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/pecan)

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/200163

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/pecan)
Download full text (28.1 KiB)

Reviewed: https://review.openstack.org/200163
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ec799c458976d5bdc03f36fa4bf56c8ca0160614
Submitter: Jenkins
Branch: feature/pecan

commit a0a022373b90835059b8949a57b097030bcbc37e
Author: John Davidge <email address hidden>
Date: Tue Jul 7 17:00:01 2015 +0100

    Fix issues with allocation pool generation for ::/64 cidr

    Passing a ::/64 cidr to certain netaddr functions without specifying
    the ip_version causes errors. Fix this by specifying ip_version.

    Change-Id: I31aaf9f5dabe4dd0845507f245387cd4186c410c
    Closes-Bug: 1472304

commit c28b6b0ef8606abea00eeea4fde96a4f646da952
Author: Brian Haley <email address hidden>
Date: Tue Jul 7 17:03:04 2015 -0400

    Remove lingering traces of q_

    The rename from Quantum to Neutron left a few q_ strings
    around, let's go ahead and clean them up.

    Change-Id: I06e6bdbd0c2f3a25bb90b5fa291009b9ec2d471d

commit 5b6ca5ce898a2e9a810ec49a1712337a41822788
Author: armando-migliaccio <email address hidden>
Date: Tue Jul 7 11:13:41 2015 -0700

    Make sure path_prefix is set during unit tests

    Change 18bc67d5 broke *-aas unit tests.

    This change ensures that mocking is done correctly, the same way
    it is done for the other plugin attributes

    Change-Id: I4167f18560e3a3aad652aae1ea9d3c6bc34dc796
    Closes-bug: #1472361

commit 13b0f6f8e2fd1e84ff3580cd75bb879e18064da6
Author: Carl Baldwin <email address hidden>
Date: Tue Jul 7 16:41:03 2015 +0000

    Add IP_ANY dict to ease choosing between IPv4 and IPv6 "any" address

    I'm working on a new patch that will add one more case where we need
    to choose between 0.0.0.0/0 and ::/0 based on the ip version. I
    thought I'd add a new constant and simplify a couple of existing uses.

    Change-Id: I376d60c7de4bafcaf2387685ddcc1d98978ce446

commit a863342caf7da9a1c0430549c1ea1e53408b34af
Author: Cyril Roelandt <email address hidden>
Date: Tue Jul 7 14:25:06 2015 +0000

    Python3: cast the result of zip() to list

    The result of get_sorts was a 'zip object' in Python 3, and it was later used
    as a list, which fails. Just cast the result to a list to fix this issue.

    Change-Id: I12017f79cad92b1da4fe5f9939b38436db7219eb
    Blueprint: neutron-python3

commit 8b13609edac2c136e1a0acbc05ad93059bb59fc1
Author: Pavel Bondar <email address hidden>
Date: Thu Jul 2 11:35:18 2015 +0300

    Track allocation_pools in SubnetRequest

    To keep pluggable and non-pluggable ipam implementation consistent
    non-pluggable one has to be switched to track allocation_pools and
    gateway_ip using SubnetRequests.
    SubnetRequest requires allocation_pools to be list of IPRanges.
    Previously allocation_pools were tracked as list of dicts.
    So allocation_pools generating and validating was moved before
    SubnetRequest is created.

    Partially-Implements: blueprint neutron-ipam

    Change-Id: I8d2fec3013b302db202121f946b53a0610ae8321

commit 04197bc4bbf2bc611371060db839028c2686f87a
Author: Kevin Benton <email address hidden>
Date: Mon Jun 29 21:05:08 2015 -0700

    Add ARP spoofing protection for ...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/202845

Changed in neutron:
milestone: none → liberty-2
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/202845
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=58276644841ddee5300a64a50e500e908554d4c5
Submitter: Jenkins
Branch: stable/kilo

commit 58276644841ddee5300a64a50e500e908554d4c5
Author: Sean M. Collins <email address hidden>
Date: Tue Jun 30 12:06:07 2015 -0400

    Remove bridge cleanup call

    Remove the bridge cleanup call to delete bridges, since we are seeing
    race conditions where bridges are deleted, then new interfaces are
    created and are attempting to plug into the bridge before it is
    recreated.

    Change-Id: I4ccc96566a5770384eacbbdc492bf09a514f5b31
    Related-Bug: #1328546
    (cherry picked from commit 8dd8a7d93564168b98fa2350eedf56acede42b0f)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/218692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/218692
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6410e017707538629b78313da033a601866d33b5
Submitter: Jenkins
Branch: stable/juno

commit 6410e017707538629b78313da033a601866d33b5
Author: Jordan Callicoat <email address hidden>
Date: Sun Aug 30 19:27:28 2015 -0500

    Remove bridge cleanup call

    Remove the bridge cleanup call to delete bridges, since we are seeing
    race conditions where bridges are deleted, then new interfaces are
    created and are attempting to plug into the bridge before it is
    recreated.

    Change-Id: I4ccc96566a5770384eacbbdc492bf09a514f5b31
    Related-Bug: #1328546
    (cherry picked from commit 8dd8a7d)

tags: added: in-stable-juno
Revision history for this message
Matt Riedemann (mriedem) wrote :

Related to comment 9, this is mgagne's patch script to remove empty linux bridges:

https://gist.github.com/mgagne/cbc9762cec9fa24294be

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/221508

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/221508
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=27f60c314bc9de5d81571de1437f93ca232f1382
Submitter: Jenkins
Branch: master

commit 27f60c314bc9de5d81571de1437f93ca232f1382
Author: Mathieu Gagné <email address hidden>
Date: Tue Sep 8 17:07:07 2015 -0400

    Add neutron-linuxbridge-cleanup util

    Removal of empty bridges have been disabled [1] to fix a race condition
    between Nova and Neutron where a bridge would be removed if
    the only instance using it is rebooted. This means empty bridges
    will pile up over time.

    This script can be used to periodically remove empty bridges by running it
    on compute nodes.

    Note: Usage of this script can still trigger the original race condition.
    It should be used when you don't expect anyone do be doing operations
    on their instances.

    [1] Commit 8dd8a7d93564168b98fa2350eedf56acede42b0f

    DocImpact: Add neutron-linuxbridge-cleanup util
    Related-bug: #1328546
    Closes-bug: #1497027
    Co-Authored-By: Cedric Brandily <email address hidden>
    Change-Id: Ieb2796381579ad295abf361ce483d979a53d2bd6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/233702

Revision history for this message
Bjoern (bjoern-t) wrote :

Hi folks,

it seems that the review https://review.openstack.org/#/c/218692 didn't make it into the branch and I have not found a conflicting commit. As of the stable/juno branch the code looks like this, where the critical change would be around 944:

            else:
                LOG.debug(_("Device %s not defined on plugin"), device)
            self.br_mgr.remove_empty_bridges()
        if self.prevent_arp_spoofing:
            arp_protect.delete_arp_spoofing_protection(devices)
        return resync

Resubmitting it via https://review.openstack.org/233702

Revision history for this message
Bjoern (bjoern-t) wrote :

It seems that the review https://review.openstack.org/#/c/209708/ was not rebased when it was committed into the branch

Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-2 → 7.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/233702
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b1263ff1b3cb5700136e2de136c9330e296c0f2c
Submitter: Jenkins
Branch: stable/juno

commit b1263ff1b3cb5700136e2de136c9330e296c0f2c
Author: Bjoern Teipel <email address hidden>
Date: Mon Oct 12 11:07:11 2015 -0500

    Resubmit of "Remove bridge cleanup call" fix

    This commit reintroduces the "Remove bridge cleanup call" fix which
    was accidentally reverted in a later change:

    https://review.openstack.org/#/c/209708/

    The original description for this fix is:

    Remove the bridge cleanup call to delete bridges, since we are seeing
    race conditions where bridges are deleted, then new interfaces are
    created and are attempting to plug into the bridge before it is
    recreated.

    Change-Id: I10ffb8fc295bbcc0f9c1a9597ae0b9272ed69c18
    Closes-Bug: #1328546
    Co-Authored-By: Sean M. Collins <email address hidden>
    Co-Authored-By: Jordan Callicoat <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/251752

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/liberty)

Change abandoned by John Schwarz (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/251752

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.