Bug #1510327 “GBP: Deleting groups leads to subnet-delete in inf...” : Bugs : Group Based Policy

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-27:

#1

Can you give a little more information on the steps that led you to this? Did you delete multiple PTGs concurrently? Were you using the CLI or UI?

Changed in group-based-policy:
status:	New → Incomplete

Revision history for this message

vks1 (vikash-kumar) wrote on 2015-10-27:

#2

This happening on every group-delete. In side effect of this the subnet 'a4619bfb-aa80-4850-af02-54f47a7b1b8e' for which the error message is coming, doesn't get deleted any time.

2015-10-27 16:12:10.060 6825 INFO neutron.plugins.ml2.plugin [-] Subnet a4619bfb-aa80-4850-af02-54f47a7b1b8e was deleted concurrently

Mandeep Dhami (dhami) on 2015-10-28

Changed in group-based-policy:
importance:	Undecided → Critical
assignee:	nobody → Sumit Naiksatam (snaiksat)

Sumit Naiksatam (snaiksat) on 2015-10-28

Changed in group-based-policy:
assignee:	Sumit Naiksatam (snaiksat) → Robert Kukura (rkukura)
milestone:	none → liberty-1

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-28:

#3

This is still not enough information to triage this issue. We don't see this issue in the gate. Can you please provide the stack trace? Also, are you saying that this is happening on every PTG delete?

Revision history for this message

Magesh GV (magesh-gv) wrote on 2015-10-28:

#4

Sumit, This was observed on gate twice last week while running UTs. One such log is below:

http://logs.openstack.org/94/239194/1/gate/gate-group-based-policy-python27/5940f47/console.html

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-28:

#5

The above log will be useful if you can tell us which test failed (I can't from that log) and how the issue can reliably reproduced. It will be much easier if you just provided the stack trace in your setup.

Can we also do the following:
* When the infinite loop happens, immediately stop the neutron server,
and then restart it. This will stop overwhelming the log and the
neutron server process.
* Now check which subnet was leading to this error, and find what are
the associated GBP and Neutron resources with this subnet (L3P, L2P,
PTG, neutron network and neutron ports).

And as before, it will help to get the exact sequence of steps that is
leading to this (is the L3P, L2P being implicitly created, is this
happening after a provide/consume and if so is there a redirect to a
contract involved, etc.)

Revision history for this message

vks1 (vikash-kumar) wrote on 2015-10-28:

#6

neutron-server.log Edit (226.9 KiB, text/plain)

One more information, about this is, this happens only when deleting the proxy subnet created for chaining.

As asked, I am attaching the log of entire delete sequence.

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-28:

#7

The beginning of the log says:
ERROR gbpservice.neutron.services.servicechain.plugins.ncp.node_drivers.oc_service_manager_client [-] Service LOADBALANCER went to ERROR state

What is in the logs prior to this? I would like to know what you were trying to create. To repeat earlier questions: "it will help to get the exact sequence of steps that is
leading to this (is the L3P, L2P being implicitly created, is this
happening after a provide/consume and if so is there a redirect to a
contract involved, etc.)"

On the second point, are you deleting the proxy subnet manually, or you are referring to the implicit flow?

Please note my earlier request -
"
Can we also do the following:
* When the infinite loop happens, immediately stop the neutron server,
and then restart it. This will stop overwhelming the log and the
neutron server process.
* Now check which subnet was leading to this error, and find what are
the associated GBP and Neutron resources with this subnet (L3P, L2P,
PTG, neutron network and neutron ports).
"

Revision history for this message

vks1 (vikash-kumar) wrote on 2015-10-28:

#8

Sumit,

This patch :

https://review.openstack.org/#/c/239788/1/gbpservice/neutron/services/grouppolicy/drivers/cisco/apic/apic_mapping.py

though got posted for some other bug , looks like fixed the issue.

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-28:

#9

Ah interesting! :-) Let's merge it then so that we can further confirm that it fixes the issue.

Revision history for this message

Sumit Naiksatam (snaiksat) wrote on 2015-10-28:

#10

It will be good to get away from the infinite loop code regardless, and as a defensive fix.

Revision history for this message

vks1 (vikash-kumar) wrote on 2015-10-30:

#11

The patch doesn't fix this completely. It still occurs in the condition where the neutron resource gets cleaned up with invoking GBP commands.

Sumit Naiksatam (snaiksat) on 2015-11-03

Changed in group-based-policy:
status:	Incomplete → Confirmed

Revision history for this message

Robert Kukura (rkukura) wrote on 2015-11-04:

#12

Does this also occur with the resource_mapping policy driver, or only with apic_mapping?

Can you summarize the steps to reproduce the issue, or tell me which UT(s) it occurs during? I can't seem to figure out which UT was running in the log in comment #4.

Is it intermittent? It seems to be, at least in the gate.

Do you happen to have DEBUG-level neutron-server logs showing the issue?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-09: Fix proposed to group-based-policy (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/243334

Changed in group-based-policy:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-13: Fix merged to group-based-policy (master)

#14

Reviewed: https://review.openstack.org/243334
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7
Submitter: Jenkins
Branch: master

commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

Partial-bug: 1510327
Related-bug: 1470646

Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-13: Fix proposed to group-based-policy (stable/kilo)

#15

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/245350

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-14: Fix merged to group-based-policy (stable/kilo)

#16

Reviewed: https://review.openstack.org/245350
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=095216e11d1c8b92694f50d7ed8bf11df5a68744
Submitter: Jenkins
Branch: stable/kilo

commit 095216e11d1c8b92694f50d7ed8bf11df5a68744
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

Partial-bug: 1510327
Related-bug: 1470646

Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e
(cherry picked from commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7)

tags:

added: in-stable-kilo

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-14: Fix proposed to group-based-policy (stable/juno)

#17

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/245435

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-15: Fix merged to group-based-policy (stable/juno)

#18

Reviewed: https://review.openstack.org/245435
Committed: https://git.openstack.org/cgit/openstack/group-based-policy/commit/?id=bea20393eacde996245c7f07f76d13fb585b96c8
Submitter: Jenkins
Branch: stable/juno

commit bea20393eacde996245c7f07f76d13fb585b96c8
Author: Robert Kukura <email address hidden>
Date: Mon Nov 9 17:16:37 2015 -0500

Limit ML2 delete_network/subnet retries

    Monkey-patch ML2's delete_network() and delete_subnet() methods to
    limit the number of times they retry to avoid potential infinite
    loops. Also add some logging to help determine when/why the
    delete_network() loops occur. This does not resolve the actual bug -
    it just mitigates the damage when it occurs.

Partial-bug: 1510327
Related-bug: 1470646

Conflicts:
gbpservice/neutron/extensions/patch_ml2.py

    Change-Id: I193d56b0ed16bcc69f434a87d11a355e9177eb1e
    (cherry picked from commit 86b4c6d42828ab5d4bc6d8b14d0e915d613fb2c7)
    (cherry picked from commit 095216e11d1c8b92694f50d7ed8bf11df5a68744)

tags:

added: in-stable-juno

Revision history for this message

Robert Kukura (rkukura) wrote on 2015-11-16:

#19

Now that fixes to prevent infinite looping have merged to juno, kilo and master, I've reduced the importance from critical to high. We now need to capture neutron-server logs when the this issue occurs and the new exception is raised so we can determine why this looping occurs. Debug level logs would be most useful, but non-debug logs may now also provide useful info, so please attach either to this bug report.

Changed in group-based-policy:
importance:	Critical → High

Sumit Naiksatam (snaiksat) on 2016-01-10

Changed in group-based-policy:
milestone:	liberty-1 → next

Group Based Policy

GBP: Deleting groups leads to subnet-delete in infinite loop

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches