delete port fails with RouterNotHostedByL3Agent exception

Bug #1367892 reported by Ed Bak
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Ed Bak

Bug Description

When deleting a vm, port_delete sometimes fails with a RouterNotHostedByL3Agent exception. This error is created by a script which boots a vm, associates a floating ip, tests that the vm is pingable, disassociates the fip and then deletes the vm. The following stack trace has been seen multiple times.

2014-09-09 11:55:59 7648 DEBUG neutronclient.v2_0.client [req-16883a09-7ec6-4159-9580-9cfa1880f786 73ae929bd62c4eddbe2f38a709265f2b 3d4668d03b5e4ac7b316aac9ff88e2db] Error message: {"NeutronError": {"message": "The router 0ffc5634-d7ff-4bc7-8dca-cbdb10414924 is not hosted by L3 agent 35f71627-3c41-4226-96dd-15faa6ec44c3.", "type": "RouterNotHostedByL3Agent", "detail": ""}} _handle_fault_response /opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py:1202
2014-09-09 11:55:59 7648 ERROR nova.network.neutronv2.api [req-16883a09-7ec6-4159-9580-9cfa1880f786 73ae929bd62c4eddbe2f38a709265f2b 3d4668d03b5e4ac7b316aac9ff88e2db] Failed to delete neutron port 41b8e31b-f459-4159-9311-d8701885f43a
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api Traceback (most recent call last):
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 448, in deallocate_for_instance
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api neutron.delete_port(port)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 101, in with_params
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api ret = self.function(instance, *args, **kwargs)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 328, in delete_port
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api return self.delete(self.port_path % (port))
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 1311, in delete
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api headers=headers, params=params)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 1300, in retry_request
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api headers=headers, params=params)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 1243, in do_request
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api self._handle_fault_response(status_code, replybody)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 1211, in _handle_fault_response
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api exception_handler_v20(status_code, des_error_body)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api File "/opt/stack/venvs/nova/local/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 68, in exception_handler_v20
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api status_code=status_code)
2014-09-09 11:55:59.153 7648 TRACE nova.network.neutronv2.api Conflict: The router 0ffc5634-d7ff-4bc7-8dca-cbdb10414924 is not hosted by L3 agent 35f71627-3c41-4226-96dd-15faa6ec44c3.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This is very weird.

Can you provide more details about this (ie. the script, Neutron's version, neutron.conf, l3_agent.ini, etc.)? There is no reason that I could think of as to why this error is raised during a vm delete action.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Ed Bak (ed-bak2) wrote :

This bug is caused by a race condition between multiple neutron-server processes. If a number of vms are deleted in sequence, dvr_deletens_if_no_vms can return an erroneous response across the multiple processes. One neutron-server process thinks that it's port which is getting deleted is the last port on the host. It then deletes the router namespace. Another process can also think that it's port which is getting deleted is the last port on the host and attempts to delete the namespace but it's already gone having been deleted by the first process.

You can also create the opposite problem where multiple neutron-server process determine that no port is the last port on the host and as a result the router namespace never gets deleted.

Ed Bak (ed-bak2)
Changed in neutron:
assignee: nobody → Ed Bak (ed-bak2)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124865

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This is clearly a DVR only problem.

Just a note: there are other places in the DVR code where we clean up resources when the last one is being cleared (for instance when we disassociate/delete FIPs). These are also prone to concurrency issues like this.

tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → Medium
tags: added: juno-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/124865
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=239b2d94339574f63afe0c6df120bfea2974ef7f
Submitter: Jenkins
Branch: master

commit 239b2d94339574f63afe0c6df120bfea2974ef7f
Author: Ed Bak <email address hidden>
Date: Mon Sep 29 14:15:52 2014 -0600

    Don't fail when trying to unbind a router

    If a router is already unbound from an l3 agent, don't fail. Log
    the condition and go on. This is harmless since it can happen
    due to a delete race condition between multiple neutron-server
    processes. One delete request can determine that it needs to
    unbind the router. A second process may also determine that it
    needs to unbind the router. The exception thrown will result
    in a port delete failure and cause nova to mark a deleted instance
    as ERROR.

    Change-Id: Ia667ea77a0a483deff8acfdcf90ca84cd3adf44f
    Closes-Bug: 1367892

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (proposed/juno)

Fix proposed to branch: proposed/juno
Review: https://review.openstack.org/126565

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → juno-rc2
tags: removed: juno-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (proposed/juno)

Reviewed: https://review.openstack.org/126565
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=75f34fbbd930a143ed2c4b868f33c117e467e98e
Submitter: Jenkins
Branch: proposed/juno

commit 75f34fbbd930a143ed2c4b868f33c117e467e98e
Author: Ed Bak <email address hidden>
Date: Mon Sep 29 14:15:52 2014 -0600

    Don't fail when trying to unbind a router

    If a router is already unbound from an l3 agent, don't fail. Log
    the condition and go on. This is harmless since it can happen
    due to a delete race condition between multiple neutron-server
    processes. One delete request can determine that it needs to
    unbind the router. A second process may also determine that it
    needs to unbind the router. The exception thrown will result
    in a port delete failure and cause nova to mark a deleted instance
    as ERROR.

    Change-Id: Ia667ea77a0a483deff8acfdcf90ca84cd3adf44f
    Closes-Bug: 1367892

Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-rc2 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/128913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/lbaasv2)

Fix proposed to branch: feature/lbaasv2
Review: https://review.openstack.org/130864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/lbaasv2)
Download full text (72.6 KiB)

Reviewed: https://review.openstack.org/130864
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c089154a94e5872efc95eab33d3d0c9de8619fe4
Submitter: Jenkins
Branch: feature/lbaasv2

commit 62588957fbeccfb4f80eaa72bef2b86b6f08dcf8
Author: Kevin Benton <email address hidden>
Date: Wed Oct 22 13:04:03 2014 -0700

    Big Switch: Switch to TLSv1 in server manager

    Switch to TLSv1 for the connections to the backend
    controllers. The default SSLv3 is no longer considered
    secure.

    TLSv1 was chosen over .1 or .2 because the .1 and .2 weren't
    added until python 2.7.9 so TLSv1 is the only compatible option
    for py26.

    Closes-Bug: #1384487
    Change-Id: I68bd72fc4d90a102003d9ce48c47a4a6a3dd6e03

commit 17204e8f02fdad046dabdb8b31397289d72c877b
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Oct 22 06:20:15 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I58db0476c810aa901463b07c42182eef0adb5114

commit d712663b99520e6d26269b0ca193527603178742
Author: Carl Baldwin <email address hidden>
Date: Mon Oct 20 21:48:42 2014 +0000

    Move disabling of metadata and ipv6_ra to _destroy_router_namespace

    I noticed that disable_ipv6_ra is called from the wrong place and that
    in some cases it was called with a bogus router_id because the code
    made an incorrect assumption about the context. In other case, it was
    never called because _destroy_router_namespace was being called
    directly. This patch moves the disabling of metadata and ipv6_ra in
    to _destroy_router_namespace to ensure they get called correctly and
    avoid duplication.

    Change-Id: Ia76a5ff4200df072b60481f2ee49286b78ece6c4
    Closes-Bug: #1383495

commit f82a5117f6f484a649eadff4b0e6be9a5a4d18bb
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 21 12:11:19 2014 +0000

    Updated from global requirements

    Change-Id: Idcbd730f5c781d21ea75e7bfb15959c8f517980f

commit be6bd82d43fbcb8d1512d8eb5b7a106332364c31
Author: Angus Lees <email address hidden>
Date: Mon Aug 25 12:14:29 2014 +1000

    Remove duplicate import of constants module

    .. and enable corresponding pylint check now the only offending instance
    is fixed.

    Change-Id: I35a12ace46c872446b8c87d0aacce45e94d71bae

commit 9902400039018d77aa3034147cfb24ca4b2353f6
Author: rajeev <email address hidden>
Date: Mon Oct 13 16:25:36 2014 -0400

    Fix race condition on processing DVR floating IPs

    Fip namespace and agent gateway port can be shared by multiple dvr routers.
    This change uses a set as the control variable for these shared resources
    and ensures that Test and Set operation on the control variable are
    performed atomically so that race conditions do not occur among
    multiple threads processing floating IPs.
    Limitation: The scope of this change is limited to addressing the race
    condition described in the bug report. It may not address other issues
    such as pre-existing issue wit...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)
Download full text (7.4 KiB)

Reviewed: https://review.openstack.org/128913
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=71df7c80b9efa84f2ef87a2299600066816870b4
Submitter: Jenkins
Branch: master

commit b28eda57223e492924edb731e24c2e4f64cc0de5
Author: Carl Baldwin <email address hidden>
Date: Wed Oct 8 03:22:49 2014 +0000

    Remove two sets that are not referenced

    The code no longer references the updated_routers and removed_routers
    sets. This should have been cleaned up before but was missed.

    Closes-bug: #1232525

    Change-Id: I0396e13d2f7c3789928e0c6a4c0a071b02d5ff17
    (cherry picked from commit edb26bfcddf9d9a0e95955a6590d11fa7245ea2b)

commit 9cce0bfdb713c2b975b289d90de6d57b68ca3854
Author: Mark McClain <email address hidden>
Date: Thu Oct 9 13:29:48 2014 +0000

    Add Juno release milestone

    Change-Id: Iea584b00329d9474c14847db958f8743d4058525
    Closes-Bug: #1378855
    (cherry picked from commit 4e8a5b7de71ba6f8c050c424613c025310498940)

commit 8e76cccb1ed9a248439b1188d1d805649169e46b
Author: Mark McClain <email address hidden>
Date: Wed Oct 8 18:49:20 2014 +0000

    Add database relationship between router and ports

    Add an explicit schema relationship between a router and its ports. This
    change ensures referential integrity among the entities and prevents orphaned
    ports.

    Change-Id: I09e8a694cdff7f64a642a39b45cbd12422132806
    Closes-Bug: #1378866
    (cherry picked from commit 93012915a3445a8ac8a0b30b702df30febbbb728)

commit 5610343d5aab876480cbe15c8d77631e67d6142f
Author: Henry Gessau <email address hidden>
Date: Tue Oct 7 20:38:38 2014 -0400

    Disable PUT for IPv6 subnet attributes

    In Juno we are not ready for allowing the IPv6 attributes on a subnet
    to be updated after the subnet is created, because:
    - The implementation for supporting updates is incomplete.
    - Perceived lack of usefulness, no good use cases known yet.
    - Allowing updates causes more complexity in the code.
    - Have not tested that radvd, dhcp, etc. behave OK after update.

    Therefore, for now, we set 'allow_put' to False for the two IPv6
    attributes, ipv6_ra_mode and ipv6_address_mode. This prevents the
    modes from being updated via the PUT:subnets API.

    Closes-bug: #1378952

    Change-Id: Id6ce894d223c91421b62f82d266cfc15fa63ed0e
    (cherry picked from commit 8a08a3cb47d0dd69d4aa2e8fa661d04054fe95ae)

commit 54be5a9e977ea344cc53addb87635ddba0cfd815
Author: Sean M. Collins <email address hidden>
Date: Mon Oct 6 15:47:24 2014 -0400

    Skip IPv6 Tests in the OpenContrail plugin

    Similar to the way we are skipping tests in the OneConvergence plugin,
    introduced by Kevin Benton in 9294de441e684a81f6e802ba0564083f1ad319d6.

    Partial-Bug: #1378952

    Change-Id: I1650b0708af73ce63e92c55bc842607bb69efe60
    (cherry picked from commit 67962943969bc737a3f680a0defc2fc9df03c429)

commit aefc12ec552afe32f0d1d6f7c8c588afac956988
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Aug 7 22:27:23 2014 +0200

    Removed kombu from requirements

    Since we've replaced oslo-incubator RPC layer with...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.