GBP: Deleting groups leads to subnet-delete in infinite loop

Bug #1510327 reported by vks1
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Group Based Policy
High
Robert Kukura

Bug Description

Using gbpservice(stable/juno). Group-deleteion leads to subnet-delete in infinite loop.

LOG:
2015-10-26 17:31:39.010 27256 INFO neutron.plugins.ml2.plugin [-] Subnet 9e1bb43f-33be-460a-8e28-ba6e93133345 was deleted concurrently

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

Can you give a little more information on the steps that led you to this? Did you delete multiple PTGs concurrently? Were you using the CLI or UI?

Changed in group-based-policy:
status: New → Incomplete
Revision history for this message
vks1 (vikash-kumar) wrote :

This happening on every group-delete. In side effect of this the subnet 'a4619bfb-aa80-4850-af02-54f47a7b1b8e' for which the error message is coming, doesn't get deleted any time.

2015-10-27 16:12:10.060 6825 INFO neutron.plugins.ml2.plugin [-] Subnet a4619bfb-aa80-4850-af02-54f47a7b1b8e was deleted concurrently

Mandeep Dhami (dhami)
Changed in group-based-policy:
importance: Undecided → Critical
assignee: nobody → Sumit Naiksatam (snaiksat)
Changed in group-based-policy:
assignee: Sumit Naiksatam (snaiksat) → Robert Kukura (rkukura)
milestone: none → liberty-1
Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

This is still not enough information to triage this issue. We don't see this issue in the gate. Can you please provide the stack trace? Also, are you saying that this is happening on every PTG delete?

Revision history for this message
Magesh GV (magesh-gv) wrote :

Sumit, This was observed on gate twice last week while running UTs. One such log is below:

http://logs.openstack.org/94/239194/1/gate/gate-group-based-policy-python27/5940f47/console.html

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

The above log will be useful if you can tell us which test failed (I can't from that log) and how the issue can reliably reproduced. It will be much easier if you just provided the stack trace in your setup.

Can we also do the following:
* When the infinite loop happens, immediately stop the neutron server,
and then restart it. This will stop overwhelming the log and the
neutron server process.
* Now check which subnet was leading to this error, and find what are
the associated GBP and Neutron resources with this subnet (L3P, L2P,
PTG, neutron network and neutron ports).

And as before, it will help to get the exact sequence of steps that is
leading to this (is the L3P, L2P being implicitly created, is this
happening after a provide/consume and if so is there a redirect to a
contract involved, etc.)

Revision history for this message
vks1 (vikash-kumar) wrote :

One more information, about this is, this happens only when deleting the proxy subnet created for chaining.

As asked, I am attaching the log of entire delete sequence.

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

The beginning of the log says:
ERROR gbpservice.neutron.services.servicechain.plugins.ncp.node_drivers.oc_service_manager_client [-] Service LOADBALANCER went to ERROR state

What is in the logs prior to this? I would like to know what you were trying to create. To repeat earlier questions: "it will help to get the exact sequence of steps that is
leading to this (is the L3P, L2P being implicitly created, is this
happening after a provide/consume and if so is there a redirect to a
contract involved, etc.)"

On the second point, are you deleting the proxy subnet manually, or you are referring to the implicit flow?

Please note my earlier request -
"
Can we also do the following:
* When the infinite loop happens, immediately stop the neutron server,
and then restart it. This will stop overwhelming the log and the
neutron server process.
* Now check which subnet was leading to this error, and find what are
the associated GBP and Neutron resources with this subnet (L3P, L2P,
PTG, neutron network and neutron ports).
"

Revision history for this message
vks1 (vikash-kumar) wrote :

Sumit,

  This patch :

https://review.openstack.org/#/c/239788/1/gbpservice/neutron/services/grouppolicy/drivers/cisco/apic/apic_mapping.py

     though got posted for some other bug , looks like fixed the issue.

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

Ah interesting! :-) Let's merge it then so that we can further confirm that it fixes the issue.

Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

It will be good to get away from the infinite loop code regardless, and as a defensive fix.

Revision history for this message
vks1 (vikash-kumar) wrote :

The patch doesn't fix this completely. It still occurs in the condition where the neutron resource gets cleaned up with invoking GBP commands.

Changed in group-based-policy:
status: Incomplete → Confirmed
Revision history for this message
Robert Kukura (rkukura) wrote :

Does this also occur with the resource_mapping policy driver, or only with apic_mapping?

Can you summarize the steps to reproduce the issue, or tell me which UT(s) it occurs during? I can't seem to figure out which UT was running in the log in comment #4.

Is it intermittent? It seems to be, at least in the gate.

Do you happen to have DEBUG-level neutron-server logs showing the issue?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to group-based-policy (master)

Fix proposed to branch: master
Review: https://review.openstack.org/243334

Changed in group-based-policy:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to group-based-policy (master)

Reviewed: https://review.openstack.org/243334
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7
Submitter: Jenkins
Branch: master

commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

    Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

    Partial-bug: 1510327
    Related-bug: 1470646

    Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to group-based-policy (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/245350

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to group-based-policy (stable/kilo)

Reviewed: https://review.openstack.org/245350
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=095216e11d1c8b92694f50d7ed8bf11df5a68744
Submitter: Jenkins
Branch: stable/kilo

commit 095216e11d1c8b92694f50d7ed8bf11df5a68744
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

    Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

    Partial-bug: 1510327
    Related-bug: 1470646

    Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e
    (cherry picked from commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to group-based-policy (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/245435

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to group-based-policy (stable/juno)

Reviewed: https://review.openstack.org/245435
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=bea20393eacde996245c7f07f76d13fb585b96c8
Submitter: Jenkins
Branch: stable/juno

commit bea20393eacde996245c7f07f76d13fb585b96c8
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

    Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

    Partial-bug: 1510327
    Related-bug: 1470646

    Conflicts:
     gbpservice/neutron/extensions/patch_ml2.py

    Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e
    (cherry picked from commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7)
    (cherry picked from commit 095216e11d1c8b92694f50d7ed8bf11df5a68744)

tags: added: in-stable-juno
Revision history for this message
Robert Kukura (rkukura) wrote :

Now that fixes to prevent infinite looping have merged to juno, kilo and master, I've reduced the importance from critical to high. We now need to capture neutron-server logs when the this issue occurs and the new exception is raised so we can determine why this looping occurs. Debug level logs would be most useful, but non-debug logs may now also provide useful info, so please attach either to this bug report.

Changed in group-based-policy:
importance: Critical → High
Changed in group-based-policy:
milestone: liberty-1 → next
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers