L3 HA: Unable to complete operation on subnet

Bug #1562878 reported by Ann Taraday on 2016-03-28
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Rally
Undecided
John Schwarz
neutron
Medium
Ann Taraday

Bug Description

Environment 3 controllers, 46 computes, liberty. L3 HA During execution NeutronNetworks.create_and_delete_routers several times test failed with "Unable to complete operation on subnet <id>. One or more ports have an IP allocation from this subnet. " trace in neutron-server logs http://paste.openstack.org/show/491557/
Rally report attached.

Current problem is with HA subnet. The side effect of this problem is bug https://bugs.launchpad.net/neutron/+bug/1562892

Ann Taraday (akamyshnikova) wrote :
Ann Taraday (akamyshnikova) wrote :

This issue is part of problem with deleting HA networks after deleting routers can be marked part of https://bugs.launchpad.net/neutron/+bug/1562892 and https://bugs.launchpad.net/neutron/+bug/1540271

description: updated
Changed in neutron:
importance: Undecided → Medium
description: updated
description: updated
description: updated
John Schwarz (jschwarz) wrote :

Is this still reproducible?

John Schwarz (jschwarz) wrote :

I've managed to reproduce this quite easily locally.

Changed in neutron:
status: New → Confirmed
Henry Gessau (gessau) wrote :

Six occurrences in the gate in the last three days.

message:"One or more ports have an IP allocation from this subnet" && filename:"console.html" && build_queue:"gate"

tags: added: gate-failure
Henry Gessau (gessau) wrote :

This started showing up after https://review.openstack.org/346288 merged.

Changed in neutron:
importance: Medium → High
Changed in neutron:
importance: High → Critical
tags: added: l3-ipam-dhcp
John Schwarz (jschwarz) wrote :

I've had a look at the logs that match the criteria Henry set in comment #5, and I believe these are 2 different problems:

* the original problem reported by Ann concerns HA routers and specifically (afaiu) the HA subnet to which interfaces are being added by one process while another process tries to delete it since it thinks it's no longer needed. In this case, a solution such as ALLOCATING for networks is suitable (though a way overkill - we might as well make this a silent error and just catch the exception).

* the criteria Henry set looks like the subnet has nothing to do with HA (the router in question isn't an HA router at all). I've looked at [1], and specifically the screen-q-svc.txt.gz, looking for the subnet that failed in that run [2] (bb64edc3-2b1e-45ef-a199-5395918c72d7).

As such, I think it's best to treat them as 2 different issues, and since it's a critical error and I'm not so sharp on DVR it's best if someone else take it instead.

[1]: http://logs.openstack.org/86/335786/20/gate/gate-tempest-dsvm-neutron-dvr/36c90cd/logs/console-q-svc.txt.gz
[2]: http://logs.openstack.org/86/335786/20/gate/gate-tempest-dsvm-neutron-dvr/36c90cd/console.html

Henry Gessau (gessau) wrote :

Thanks John. Is the non-HA problem specific to DVR jobs? Did you file a bug for it?

Just hit it in grenade dvr multinode job: http://logs.openstack.org/64/353664/1/check/gate-grenade-dsvm-neutron-dvr-multinode/9822751/logs/

Note that it hit on the old side of the cloud, so it's not Newton. Since the patch Henry mentioned in the comment #6 is in Newton only, it's probably not related.

John Schwarz (jschwarz) wrote :

Sorry for the delay, I wasn't getting notifications on this bug.

AFAIK the non-HA problem is specific to DVR job. I just filed a bug for it: https://bugs.launchpad.net/neutron/+bug/1612192, so lets use this bug to track the HA part of it.

This can be dropped from Critical as it's not affecting the gate as of yet.

tags: removed: gate-failure l3-ipam-dhcp
Changed in neutron:
importance: Critical → High
Changed in neutron:
importance: High → Medium
John Schwarz (jschwarz) wrote :

This patch has resurfaced: https://bugs.launchpad.net/tripleo/+bug/1638690

We have a reproduced environment, looking at it now.

John Schwarz (jschwarz) wrote :

I found the bug, and it's in rally. Patch Ieab53624dc34dc687a0e8eebd84778f7fc95dd77 added a new type of router interface value for "device_owner", called "network:ha_router_replicated_interface". However, rally was not made aware of it so it thinks this interface is a normal port, trying to delete it with a normal 'neutron port-delete' (and not 'neutron router-interface-remove').

I'll adjust the bug report and will submit a fix for rally.

Changed in neutron:
status: Confirmed → Invalid
Changed in rally:
assignee: nobody → John Schwarz (jschwarz)
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/394354

Changed in rally:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/394354
Committed: https://git.openstack.org/cgit/openstack/rally/commit/?id=41010685c40ce765777200018faff184e515602b
Submitter: Jenkins
Branch: master

commit 41010685c40ce765777200018faff184e515602b
Author: John Schwarz <email address hidden>
Date: Mon Nov 7 12:36:51 2016 +0200

    Add missing device_owner for L3 HA's case

    Patch Ieab53624dc34dc687a0e8eebd8477 added a new possible value for the
    device_owner field of a port, that signifies a router's interface for a
    subnet. The rally code was not changed accordingly, which causes rally
    to try to delete the port using the wrong API ('neutron port-delete'
    instead of 'neutron router-interface-delete'), causing cleanup errors
    and leftover resources.

    Closes-Bug: #1562878
    Change-Id: I65933650d613e1800541dfdb33714a50f5c13db7

Changed in rally:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers