FWaaS - firewall deleted unexpectedly by agent

Bug #1483875 reported by Sean M. Collins
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Armando Migliaccio

Bug Description

q-svc log entries

2015-08-11 14:49:11.453 WARNING neutron_fwaas.services.firewall.fwaas_plugin [req-bc4d0a13-2dc2-4dd0-9e70-ee8b8f658418 FWaaSExtensionTestJSON-142252539 FWaaSExtensionTestJSON-1602220697] Firewall ea4d307d-b3cc-408a-a491-1c12d3d7c4a2 unexpectedly deleted by agent, status was ACTIVE
2015-08-11 14:49:11.455 WARNING neutron_fwaas.services.firewall.fwaas_plugin [req-bc4d0a13-2dc2-4dd0-9e70-ee8b8f658418 FWaaSExtensionTestJSON-142252539 FWaaSExtensionTestJSON-1602220697] Firewall ea4d307d-b3cc-408a-a491-1c12d3d7c4a2 unexpectedly deleted by agent, status was ACTIVE

<snip>

2015-08-11 14:54:12.642 INFO neutron.api.v2.resource [req-b3dc443e-cbf2-4f08-88e8-7dc27166c69b FWaaSExtensionTestJSON-142252539 FWaaSExtensionTestJSON-1602220697] delete failed (client error): Firewall Policy da881afc-c237-40e5-a215-c7120dfcca9c is being used.

Kibana query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwidGVhckRvd25DbGFzcyAobmV1dHJvbi50ZXN0cy5hcGkudGVzdF9md2Fhc19leHRlbnNpb25zLkZXYWFTRXh0ZW5zaW9uVGVzdEpTT04pXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjQzMjAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQzOTMyMjE4Nzg0M30=

tags: added: fwaas
description: updated
description: updated
description: updated
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

This started to occur once devstack-gate patch switched neutron to new db driver and workers: https://review.openstack.org/#q,08daf5dc1413b12871cf6adda9fa3a36aab35deb,n,z

The revert for that is at: https://review.openstack.org/#/c/211853/

We may need unstable jobs for *aas to make sure it's tested before moving jobs to the new driver.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/211947

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/211979

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/211979
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=764e716cd658dfa9b4cbb07be6bbbeff99e21eaf
Submitter: Jenkins
Branch: master

commit 764e716cd658dfa9b4cbb07be6bbbeff99e21eaf
Author: Sean M. Collins <email address hidden>
Date: Wed Aug 12 07:42:04 2015 -0400

    Skip FwaaS test that is failing due to race condition

    Change-Id: I8a7669b6aff0b407b192af803fbfb636c8f118bb
    Related-Bug: #1483875

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

So here are some ideas about how it could fail.

The test creates a firewall, then deletes it and waits until the db object is gone. API delete != db delete for fwaas firewalls. Instead, the service plugin sets the status of the firewall to PENDING_DELETE, then sends delete requests to the agent serving the firewall, and proceeds. Then the agent shuts it down and replies with firewall_deleted() RPC. It checks the status is still PENDING_DELETE and deletes the db object. But if the status is some other value, it logs the warning and does nothing. That's what happens, and that's why API test times out waiting on the firewall to be deleted from the db.

Now, the status of the firewall at the time of firewall_deleted() warning is ACTIVE. It may be the case that we have a race between PENDING_DELETE and set_firewall_status which can set ACTIVE for the firewall.

I see that delete_firewall entrypoint of the service plugin uses multiple transactions to access the same firewall. I think the race can be hidden there. Now that db driver yields, it could yield to a thread that handles status updates and update it to ACTIVE somewhere in the middle of delete processing.

The test does nothing with the firewall, so delete request hits neutron-server right after create, probably at the moment when the agent was initializing it and was about to reply with ACTIVE status.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

From a look at the code, my understanding is that eventually the firewall should go in error state, but it doesn't look like it does; there may be another issue at play here.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-fwaas (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/212143

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/212145

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/212226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/212226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/212143
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=6df507e00ff1cfac39e3a24c55190ce58774bc31
Submitter: Jenkins
Branch: master

commit 6df507e00ff1cfac39e3a24c55190ce58774bc31
Author: armando-migliaccio <email address hidden>
Date: Wed Aug 12 10:47:58 2015 -0700

    Do not make delete_db_firewall conditional

    The delete_db_firewall should not be made conditional, as the
    ability to delete the firewall is already determined by the caller.

    This patch improves debug statements to ease troubleshooting.

    Related-bug: #1483875

    Change-Id: I35b171e5827af9b6217605c0654386704f4fab83

Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (master)

Fix proposed to branch: master
Review: https://review.openstack.org/213936

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/213937

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron-fwaas (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/213936
Reason: Apparently a non-solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/213937
Reason: The alternative approach does not work.

Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → Armando Migliaccio (armando-migliaccio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/212811
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=1ca1fc300d6bd9b22f875f593c461b8b2ca547cf
Submitter: Jenkins
Branch: master

commit 1ca1fc300d6bd9b22f875f593c461b8b2ca547cf
Author: armando-migliaccio <email address hidden>
Date: Thu Aug 13 15:00:07 2015 -0700

    Reduce contention between update and delete

    When a delete comes right after an update, the firewall state
    can change to ACTIVE right before the RPC message is sent over
    the wire, but after the state was transitioned to PENDING_DELETE.

    This ends up confusing the agent which sees a firewall being
    deleted but with an ACTIVE state.

    Deferring the state transition closer to the RPC message minimizes
    the chances of a race, and forcing the right state in the payload
    ensures that the agent gets what it needs.

    Closes-bug: #1483875

    Change-Id: I455bbe00395377509beca5ec1d54cf4f7a57152d

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/212145
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7e0a4dc9797101833e183c3d1b95fb947af49eac
Submitter: Jenkins
Branch: master

commit 7e0a4dc9797101833e183c3d1b95fb947af49eac
Author: armando-migliaccio <email address hidden>
Date: Wed Aug 12 11:49:46 2015 -0700

    Unskip firewall test

    Dependent patch addresses the issue, so we no longer
    need to skip the test

    Related-bug: #1483875

    Depends-on: I455bbe00395377509beca5ec1d54cf4f7a57152d
    Depends-on: I00d99c05328268b33c6f20b1d24e7507afccb92c

    Change-Id: I41b51f36c1344dce20b1b9437c175469acfafb57

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I see it occurs again in gate after the test was re-enabled.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
milestone: none → liberty-3
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I'll re-open it and look into it.

Changed in neutron:
status: Fix Committed → In Progress
importance: High → Critical
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

It has become less frequent but it still bites us.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

42 hits in the last 48 hours apparently.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-fwaas (master)

Fix proposed to branch: master
Review: https://review.openstack.org/218034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/218035

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron-fwaas (master)

Reviewed: https://review.openstack.org/218034
Committed: https://git.openstack.org/cgit/openstack/neutron-fwaas/commit/?id=7c723e4b9a140d472dd272f4ebf05df37942d696
Submitter: Jenkins
Branch: master

commit 7c723e4b9a140d472dd272f4ebf05df37942d696
Author: armando-migliaccio <email address hidden>
Date: Thu Aug 27 18:06:06 2015 -0700

    Kill contention between update and delete

    Patch I455bbe00395377509beca5ec1d54cf4f7a57152d didn't go enough and we knew it. This patch
    overhauls the status update logic to kill the race contention once and for all by updating in
    place (or at least that's the wishful thinking).

    The logic of preventing updates for firewall in progress of being deleted is preserved. This
    is done by introducing a new update method that encapsulates DB logic where it belongs.

    Code being affected is covered already so no more coverage is deemed necessary.

    Closes-bug: 1483875

    Change-Id: Ie6fc6d2545b778f8264d57201827dd82618259f9

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/218035

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (feature/pecan)

Related fix proposed to branch: feature/pecan
Review: https://review.openstack.org/218710

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (feature/pecan)
Download full text (155.6 KiB)

Reviewed: https://review.openstack.org/218710
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2c5f44e1b3bd4ed8a0b7232fd293b576cc8c1c87
Submitter: Jenkins
Branch: feature/pecan

commit f35d1c5c50dccbef1a2e079f967b82f0df0e22e9
Author: Adelina Tuvenie <email address hidden>
Date: Thu Aug 27 02:27:28 2015 -0700

    Fixes wrong neutron Hyper-V Agent name in constants

    Change Id03fb147e11541be309c1cd22ce27e70fadc28b5 moved the
    AGENT_TYPE_HYPERV constant from common.constants to
    plugins.ml2.drivers.hyperv.constants but change the value of the
    constant from 'HyperV agent' to 'hyperv'. This patch changes
    the name back to 'HyperV agent'

    Change-Id: If74b4b2a84811e266c8b12e70bf6bfe74ed4ea21
    Partial-Bug: #1487598

commit de604de334854e2eb6b4312ff57920564cbd4459
Author: OpenStack Proposal Bot <email address hidden>
Date: Sun Aug 30 01:39:06 2015 +0000

    Updated from global requirements

    Change-Id: Ie52aa3b59784722806726e4046bd07f4a4d97328

commit f0415ac20eaf5ab4abb9bd4839bf6d04ceee85d0
Author: armando-migliaccio <email address hidden>
Date: Fri Aug 28 13:53:04 2015 -0700

    Revert "Add support for unaddressed port"

    This implementation may expose a vulnerability where a malicious
    user can sieze the opportunity of a time window where a port
    may land unaddressed on a shared network, thus allowing him/her
    to suck up all the tenant traffic he/she wants....oh the shivers.

    This reverts commit d4c52b7f5a36a103a92bf9dcda7f371959112292.

    Change-Id: I7ebdaa8d3defa80eab90e460fde541a5bdd8864c

commit 013fdcd2a6d45dbe4de5d6e7077e5e9b60985ef9
Author: Assaf Muller <email address hidden>
Date: Fri Aug 28 16:41:07 2015 -0400

    Improve logging upon failure in iptables functional tests

    This will help us nail down a more accurate and efficient logstash
    query.

    Change-Id: Iee4238e358f7b056e373c7be8d6aa3202117a680
    Related-Bug: #1478847

commit 622dea818d851224a43d5276a81d5ce8a6eebb76
Author: Ivar Lazzaro <email address hidden>
Date: Mon Aug 17 17:17:42 2015 -0700

    handle gw_info outside of the db transaction on router creation

    Move the gateway interface creation outside the DB transaction
    to avoid lock timeout.

    Change-Id: I5a78d7f32e8ca912016978105221d5f34618af19
    Closes-bug: 1485809

commit 5b27d290a0a95f6247fc5a0fe6da1e7d905e6b2d
Author: Assaf Muller <email address hidden>
Date: Wed Aug 26 10:07:03 2015 -0400

    Remove ml2 resource extension success logging

    This is the cause of a tremendous amount of logs, for no
    perceivable gain. A normal dvr run in the gate shows this debug
    message around 120K times, which is way too much.

    Closes-Bug: #1489952

    Change-Id: I26fca8515d866a7cc1638d07fa33bc04479ae221

commit 8d3faf549cba2f58c872ef4121b2481e73464010
Author: huangpengtao <email address hidden>
Date: Fri Aug 28 23:20:46 2015 +0800

    Replace "prt" variable by "port"

    the local variable prt is meaningless,
    and port is used popular.

    Change-Id: I20849102cf5b4d84433c46791b4b1e2a22dc4739

commit ee374e7a5f4dea538fcd942f5...

tags: added: in-feature-pecan
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Looks like https://review.openstack.org/218034 did it. I can't seem to see any occurrences of this bug in the last 72 hours...yay!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Sean M. Collins (<email address hidden>) on branch: master
Review: https://review.openstack.org/211947

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-3 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.