New race condition exposed when cleaning up floating ips on router delete

Bug #1373100 reported by Carl Baldwin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Armando Migliaccio

Bug Description

The patch that cleans up floating ips on router deletion [1] has triggered a race condition that causes spurious failures in the dvr job in the check queue. Reverting this patch [2] has shown to stabilize it.

[1] https://review.openstack.org/#/c/120885/
[2] https://review.openstack.org/#/c/121729/

Changed in neutron:
milestone: none → juno-rc1
description: updated
Changed in neutron:
importance: Undecided → Critical
tags: added: l3-dvr-backlog
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Some notes that I got from Rajeev:

Going through couple of logs it appears that the patch exposes a race condition. It looks like a delete router (A) message to L-3 clears up the fip namespace and the rtr2fip links while the L-3 is in the middle of configuring a floating IP for another router (B).

When (B) first checks for the presence of agent_gateway_port all appears good but by the time (B) comes around to configure the rtr2fp or fp2rtr, (A) has cleared up the agent_gateway_port and associated links. Causing a failure.

...
Thanks,
-Rajeev.

Changed in neutron:
assignee: nobody → Rajeev Grover (rajeev-grover)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123634

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

I think taking the revert is the best we can do at the moment,

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/123881

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Rajeev Grover (<email address hidden>) on branch: master
Review: https://review.openstack.org/123881
Reason: Apologies. Did not intend this to be a new patch, wrong Change-Id slipped in.

Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Carl Baldwin (carl-baldwin)
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Don't be confused. This patch [1] has not been abandoned and is currently the best thing we have to address this issue.

[1] https://review.openstack.org/123634

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I spent the past few hours about this.

Fix [1] was a partial fix for bug [2]. The way I understand commit message for [1] is that the patch was to ensure that FIP namespaces were cleared upon router delete operations. However, running master + revert [4] shows that under no circumstances the FIP namespaces lie around after having deleted a VM (with or without disassociating the FIP first).

Therefore, purely on the basis that I believe that there is not enough clarity about what [1] is for, I propose we actually revert [1], and hence take [4] as a fix for this bug, also as Salvatore suggested.

We may still need to look into the circumstances behind the race condition, but I believe that taking keeping [1] as baseline will skew the results of the investigation.

[1] - https://review.openstack.org/#/c/120885/
[2] - https://bugs.launchpad.net/neutron/+bug/1367588
[3] - https://review.openstack.org/#/c/120917/
[4] - https://review.openstack.org/#/c/121729/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/121729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=45a523681f2136f8fefb6c3da44540decd6a0fda
Submitter: Jenkins
Branch: master

commit 45a523681f2136f8fefb6c3da44540decd6a0fda
Author: armando-migliaccio <email address hidden>
Date: Mon Sep 15 18:40:08 2014 -0700

    Revert "Cleanup floatingips also on router delete"

    This reverts commit c3326996e38cb67f8d4ba3dabd829dc6f327b666.

    The patch being reverted here addresses an issue that can no longer be
    reproduced, in that under no circumstances, I can make the FIP lie around
    before deleting a router (which can only be done after all FIP have been
    disassociated or released).

    Unless we have more clarity as to what the initial commit was really meant
    to fix, there is a strong case for reverting this patch at this point.

    Closes-bug: #1373100

    Change-Id: I7e0f80e456ff4d9eb57a1d31c6ffc7cdfca5a163

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-rc1 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Carl Baldwin (<email address hidden>) on branch: master
Review: https://review.openstack.org/123634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/lbaasv2)

Fix proposed to branch: feature/lbaasv2
Review: https://review.openstack.org/130864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/lbaasv2)
Download full text (72.6 KiB)

Reviewed: https://review.openstack.org/130864
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c089154a94e5872efc95eab33d3d0c9de8619fe4
Submitter: Jenkins
Branch: feature/lbaasv2

commit 62588957fbeccfb4f80eaa72bef2b86b6f08dcf8
Author: Kevin Benton <email address hidden>
Date: Wed Oct 22 13:04:03 2014 -0700

    Big Switch: Switch to TLSv1 in server manager

    Switch to TLSv1 for the connections to the backend
    controllers. The default SSLv3 is no longer considered
    secure.

    TLSv1 was chosen over .1 or .2 because the .1 and .2 weren't
    added until python 2.7.9 so TLSv1 is the only compatible option
    for py26.

    Closes-Bug: #1384487
    Change-Id: I68bd72fc4d90a102003d9ce48c47a4a6a3dd6e03

commit 17204e8f02fdad046dabdb8b31397289d72c877b
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Oct 22 06:20:15 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I58db0476c810aa901463b07c42182eef0adb5114

commit d712663b99520e6d26269b0ca193527603178742
Author: Carl Baldwin <email address hidden>
Date: Mon Oct 20 21:48:42 2014 +0000

    Move disabling of metadata and ipv6_ra to _destroy_router_namespace

    I noticed that disable_ipv6_ra is called from the wrong place and that
    in some cases it was called with a bogus router_id because the code
    made an incorrect assumption about the context. In other case, it was
    never called because _destroy_router_namespace was being called
    directly. This patch moves the disabling of metadata and ipv6_ra in
    to _destroy_router_namespace to ensure they get called correctly and
    avoid duplication.

    Change-Id: Ia76a5ff4200df072b60481f2ee49286b78ece6c4
    Closes-Bug: #1383495

commit f82a5117f6f484a649eadff4b0e6be9a5a4d18bb
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 21 12:11:19 2014 +0000

    Updated from global requirements

    Change-Id: Idcbd730f5c781d21ea75e7bfb15959c8f517980f

commit be6bd82d43fbcb8d1512d8eb5b7a106332364c31
Author: Angus Lees <email address hidden>
Date: Mon Aug 25 12:14:29 2014 +1000

    Remove duplicate import of constants module

    .. and enable corresponding pylint check now the only offending instance
    is fixed.

    Change-Id: I35a12ace46c872446b8c87d0aacce45e94d71bae

commit 9902400039018d77aa3034147cfb24ca4b2353f6
Author: rajeev <email address hidden>
Date: Mon Oct 13 16:25:36 2014 -0400

    Fix race condition on processing DVR floating IPs

    Fip namespace and agent gateway port can be shared by multiple dvr routers.
    This change uses a set as the control variable for these shared resources
    and ensures that Test and Set operation on the control variable are
    performed atomically so that race conditions do not occur among
    multiple threads processing floating IPs.
    Limitation: The scope of this change is limited to addressing the race
    condition described in the bug report. It may not address other issues
    such as pre-existing issue wit...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.