Bad router request: Router already has a port on subnet

Bug #1494114 reported by Armando Migliaccio
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Armando Migliaccio
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Carl: can someone in the L3 team triage this please?

description: updated
tags: added: l3-dvr-backlog
Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
summary: - test_cross_tenant_traffic fails with badrequest
+ Bad router request: Router already has a port on subnet
description: updated
Changed in neutron:
assignee: nobody → Carl Baldwin (carl-baldwin)
description: updated
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I spent just a little time looking at some logs. It seems it is a different test each time. Not test_cross_tenant_traffic each time. Just looking at the four most recent logs, I see these tests failing.

TestGettingAddress.test_dualnet_multi_prefix_dhcpv6_stateless
TestGettingAddress.test_dualnet_dhcp6_stateless_from_os
TestSecurityGroupsBasicOps.test_cross_tenant_traffic
TestGettingAddress.test_dualnet_multi_prefix_slaac

I also noticed that most of the logs (but not all) also have this stack trace [1] accompanying the error. The ones that did had a stack trace similar to this [2] going through _prepare_and_test. The few that didn't seemed to have a different stack trace [3] but the same error message.

[1] http://paste.openstack.org/show/455392/
[2] http://paste.openstack.org/show/455397/
[3] http://paste.openstack.org/show/455399/

Revision history for this message
Miguel Lavalle (minsel) wrote :

I analyzed one of the failures and was able to cobble together the information here http://paste.openstack.org/show/455851/

So it seems we have a deadlock when trying the foloowing update to an allocation pool:

SQL: u'UPDATE ipavailabilityranges SET first_ip=%s WHERE ipavailabilityranges.allocation_pool_id = %s AND ipavailabilityranges.first_ip = %s AND ipavailabilityranges.last_ip = %s'] [parameters: ('10.100.0.4', u'e2b7dccc-4ecb-4dd2-a994-b9700f19bf22', u'10.100.0.3', u'10.100.0.14')]

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I'll have plenty of time to kill at the airport later today...I'll see if I can provide any more input, but the fact that this happens during multiple tests makes me wonder whether this has to do with setup logic.

Revision history for this message
Miguel Lavalle (minsel) wrote :

Another failure that I analyzed. Different test method, same tempest script. Same blow up in neutron server:

http://paste.openstack.org/show/457057/

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

@Armando Looking through the stack traces I linked in comment #1 ([2] and [3]) seem to support your suspicion. It looks like prepare_network is shared among a lot of different tests and is run near the beginning of the test.

I wonder if the two failure modes, [2] and [3], can be examined independently. I think we should start focusing on failures containing a stack trace that looks like [2]. These seem to always be accompanied by a failure in the delete_wrapper which appears to be a problem with cleaning up resources, maybe from the previous test? The failure mode in [2] also seems to be much more common than the one in [3].

[1] http://paste.openstack.org/show/455392/
[2] http://paste.openstack.org/show/455397/
[3] http://paste.openstack.org/show/455399/

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I think I got it. This is what happens:

a) the test wants to add the router interface
b) a db deadlock happens during _add_csnat_router_interface_port in l3_dvr_db [1]
c) the API layer issues a retry on db deadlock
d) the operation 'add_router_interface' fails on [2], because we have not rolled back the operation that initially succeeded in the first pass (the method body is not within a single subtransaction).

So in a nutshell the Bad Request is masking the deadlock: the retry is happening on inconsistent state. Vive l'atomicité!!!

[1] https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvr_db.py#L325
[2] https://github.com/openstack/neutron/blob/master/neutron/db/l3_dvr_db.py#L312

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/222873

Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Armando Migliaccio (armando-migliaccio)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/222873
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dafa61bd46b7eacbc708d17a3fa492de971d6dd2
Submitter: Jenkins
Branch: master

commit dafa61bd46b7eacbc708d17a3fa492de971d6dd2
Author: armando-migliaccio <email address hidden>
Date: Sat Sep 12 12:07:35 2015 -0700

    Fix BadRequest error on add_router_interface for DVR

    This operation for DVR is made of multiple steps, some of
    which are not within the same DB transaction. For this
    reason, if a failure occurs, the rollback will be partial.

    This inconsistent state leads the retry logic to fail with
    BadRequest, because the router is believed to be already
    connected to the subnet.

    To fix this condition, it is necessary to delete the port
    should the DB deadlock occur.

    Closes-bug: #1494114

    Change-Id: Ia2a73d6f9d1e4746e761ad072d954e64267a3ad1

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Miguel Lavalle (minsel) wrote :

As of 9/16, 0 events after fix was merged

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

rock on!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/224357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (73.6 KiB)

Reviewed: https://review.openstack.org/224357
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fdc3431ccd219accf6a795079d9b67b8656eed8e
Submitter: Jenkins
Branch: feature/pecan

commit fe236bdaadb949661a0bfb9b62ddbe432b4cf5f1
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Sep 3 15:40:12 2015 +0200

    No network devices on network attached qos policies

    Network devices, like internal router legs, or dhcp ports
    should not be affected by bandwidth limiting rules.

    This patch disables application of network attached policies
    to network/neutron owned ports.

    Closes-bug: #1486039
    DocImpact

    Change-Id: I75d80227f1e6c4b3f5fa7762b8dc3b0c0f1abd46

commit db4a06f7caa20a4c7879b58b20e95b223ed8eeaf
Author: Ken'ichi Ohmichi <email address hidden>
Date: Wed Sep 16 10:04:32 2015 +0000

    Use tempest-lib's token_client

    Now tempest-lib provides token_client modules as library and the
    interface is stable. So neutron repogitory doesn't need to contain
    these modules.
    This patch makes neutron use tempest-lib's token_client and removes
    the own modules for the maintenance.

    Change-Id: Ieff7eb003f6e8257d83368dbc80e332aa66a156c

commit 78aed58edbe6eb8a71339c7add491fe9de9a0546
Author: Jakub Libosvar <email address hidden>
Date: Thu Aug 13 09:08:20 2015 +0000

    Fix establishing UDP connection

    Previously, in establish_connection() for UDP protocol data were sent
    but never read on peer socket. That lead to successful read on peer side
    if this connection was filtered. Having constant testing string masked
    this issue as we can't distinguish to which test of connectivity data
    belong.

    This patch makes unique data string per test_connectivity() and
    also makes establish_connection() to create an ASSURED entry in
    conntrack table. Finally, in last test after firewall filter was
    removed, connection is re-established in order to avoid troubles with
    terminated processes or TCP continuing sending packets which weren't
    successfully delivered.

    Closes-Bug: 1478847
    Change-Id: I2920d587d8df8d96dc1c752c28f48ba495f3cf0f

commit e6292fcdd6262434a7b713ad8802db6bc8a6d3dc
Author: YAMAMOTO Takashi <email address hidden>
Date: Wed Sep 16 13:20:51 2015 +0900

    ovsdb: Fix a few docstring

    Change-Id: I53e1e21655b28fe5da60e58aeeb7cbbd103ae014

commit c22949a4449d96a67caa616290cf76b67b182917
Author: fumihiko kakuma <email address hidden>
Date: Wed Sep 16 11:52:59 2015 +0900

    Remove requirements.txt for the ofagent mechanism driver

    It is no longer used.

    Related-Blueprint: core-vendor-decomposition
    https://blueprints.launchpad.net/neutron/+spec/core-vendor-decomposition

    Change-Id: Ib31fb3febf8968e50d86dd66e1e6e1ea2313f8ac

commit d1d4de19d85f961d388c91e70f31b3bafec418c5
Author: Kevin Benton <email address hidden>
Date: Thu Sep 3 20:25:57 2015 -0700

    Always return iterables in L3 get_candidates

    The caller of this function expects iterables.

    Closes-Bug: #1494996
    Change-Id: I3d103e63f4e127a77268502415c0ddb0d804b54a

commit 1ad6ac448067306...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Doug Wiegley (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/224334

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-rc1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.