Firewall fails to become active within 300 seconds

Bug #1314313 reported by Salvatore Orlando
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Low
Unassigned

Bug Description

Upstream jobs are revealing that a FWaaS extension test is failing because the firewall does not become active within tempest's timeout (300 seconds).

It seems the l3 agent receives the firewall create notification from the neutron server but then either the firewall is not processed or the notification not sent back to the neutron server.

logstash: http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiVGltZWQgb3V0IHdhaXRpbmcgZm9yIGZpcmV3YWxsXCIgQU5EIG1lc3NhZ2U6XCJiZWNvbWUgQUNUSVZFXCIgQU5EIHRhZ3M6XCJjb25zb2xlXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjE3MjgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTg3OTM2MTYxOTd9

This bug is being marked as critical because of 9 hits in the gate queue in the past 48 hours.

tags: added: icehouse-backport-potential
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

taking the bug for triaging

Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The firewall is not being created because according to the l3 agent there is no router for tenant where the firewall could be attached:
http://logs.openstack.org/84/89484/4/gate/gate-tempest-dsvm-neutron/e22289e/logs/screen-q-vpn.txt.gz#_2014-04-30_15_44_39_238

This is happening because the router has not yet been processed by the agent. The router create notification was indeed received but not yet processed. This means the router is not yet in the agent's router_info structure, and then discarded when considering routers where the firewall can be implemented: https://github.com/openstack/neutron/blob/master/neutron/services/firewall/agents/l3reference/firewall_l3_agent.py#L102

This is apparently not an easy fix, as the ability of successfully creating a firewall depends on the configuration state of a router. If the create_firewall message can't be processed because the router is not yet configured on the l3 agent the firewall will never be configured.

Also, an interesting message has been noted:
http://logs.openstack.org/84/89484/4/gate/gate-tempest-dsvm-neutron/e22289e/logs/screen-q-vpn.txt.gz#_2014-04-30_15_44_38_852

This message complains that the system can't delete the router, but nobody asked it to delete it. There is a likely explanation with this, and a different bug report should be filed.

Changed in neutron:
status: New → Triaged
Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

Thanks Salvatore for triaging this. Your analysis is spot on, the firewall depends on the router, and if the router is not there, the firewall cannot be enabled. This firewall should eventually be enabled though once the router is created/processed. It might just not be happening in the time window given to it by the test. We can disable the particular test for now, and/or try adjusting the wait.

Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

Thanks Sumit, Salavatore.

To add to Sumit, even in case there is a race, the firewall will start off being in a PENDING_CREATE state and when the router and interface add happens - this will be processed and which will push the firewall to an ACTIVE state.
https://github.com/openstack/neutron/blob/master/neutron/services/firewall/agents/l3reference/firewall_l3_agent.py#L220

I retested this manually to ensure that there is no regression here. But since this seems to happen intermittently - seems like some timing window - but it is not very obvious to me how we land here.

Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

Salvatore, Sumit - some more data from traversing the logs - i see some light at the end of the tunnel (but i think it is an oncoming locomotive)

1) In this run the tests are run twice,
     tempest.api.network.test_fwaas_extensions.FWaaSExtensionTestJSON (All tests pass)
     tempest.api.network.test_fwaas_extensions.FWaaSExtensionTestXML (one test fails)

2) In the good run, behavior is as expected (it so happens that the router_added event happens after the firewall create and that is handled by the agent going back to the plugin to pick up the config, applying it and sending a msg back to the plugin (verified on the server side logs) which moves the state from PENDING_CREATE to ACTIVE). Snippets of the relevant logs here:

http://paste.openstack.org/show/79121/

3) In the bad run, the router in question, 29373f85-a738-4844-83b9-88368f989cb0, seems to have an issue just as the test is being run:

2014-04-30 15:44:38.852 19165 WARNING neutron.agent.l3_agent [-] Info for router 29373f85-a738-4844-83b9-88368f989cb0 were not found. Skipping router removal

Salvatore, u have pointed this out in ur notes as well and this seems relevant to what we observe.

As a result, process_router_add which gets kicked on a new router with an i/f add which picks up the config does not seem to happen in the expected time. Snippets here:

http://paste.openstack.org/show/79126/

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Patch 92018 (and 92147 as an alternative) deal with the issue that causes the warning message you saw.
That is a nasty issue which potentially causes a router to not be processed.

It might the cause of the firewall test failure. I just want to confirm with you that the current firewall workflow does not require the router to be ACTIVE. If a router is still DOWN when the firewall is created, then the firewall will be enabled and configured as soon as the router goes up. This is rather important.

Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

Thanks Salvatore. From the current analysis it does seem that this is the same issue as u point out.

On the Firewall workflow, that is correct, the Router can come up after the firewall create. The FW will be in PENDING_CREATE state and when the router comes up and on interface add the Agent will query the plugin for the config, apply and send the status back to the plugin to drive the state to ACTIVE.

Pls let me know if u need further information on the FW side of things.

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Decreasing to high because there have not been hits in the gate for about a week.
The patch around races in router creation improved the situation for this bug, which however is not yet fixed.

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → nobody
milestone: none → juno-1
importance: Critical → High
Changed in neutron:
assignee: nobody → Sridar Kandaswamy (skandasw)
Changed in neutron:
status: Triaged → Confirmed
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Adding tempest.
Adding short delay between operations in fwaas api test may help to avoid race

Changed in tempest:
status: New → Confirmed
Changed in tempest:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/97218

Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

Adding a delay between the operations will certainly make things work so worthwhile to do for gate if this is a problem but i am trying to get to the root cause to flush out the timing issue.

Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

My initial assessment was that we were seeing is a race condition btwn creating a firewall and adding the router. But in looking at the associated logs - at least in the sampling i have done - what i see is that the L3Agent(q-vpn) logs show that it failed to start. The q-svc log is peppered with msg such as:

2014-06-04 01:00:42.977 14963 WARNING neutron.scheduler.l3_agent_scheduler [req-f9ec25f6-2fab-46cc-ba7c-911e68c4c5d7 None] No active L3 agents

Now if L3Agent is not running - the firewall will be stuck in PENDING_CREATE as the fwaas Plugin waits on the msg from the Agent acknowledging that the FW rules were applied before going to ACTIVE state.

http://paste.openstack.org/show/82806/

 I will sample more logs to see if a fwaas issue is seen that causes this behavior.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

It appears to be the issue of one particular patch on review, not the issue of master.
At least I see only review #97505 and #97506 in logstash

Changed in neutron:
importance: High → Low
no longer affects: tempest
Revision history for this message
Sridar Kandaswamy (skandasw) wrote :

Thanks Eugene for taking a look right away - i think i also saw this on 97623 - will take a look if this is seen on master.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Yep, there is whole set of patches that depend on 97505, i've missed them at first.
97505 has issue that leads l3 agent to fail.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/97218
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=85c7dc391f110ab916ee902dec80ce44995f002d
Submitter: Jenkins
Branch: master

commit 85c7dc391f110ab916ee902dec80ce44995f002d
Author: Eugene Nikanorov <email address hidden>
Date: Mon Jun 2 14:26:09 2014 +0400

    Log firewall status on delete in case of status inconsistency

    This should help to troubleshoot related gate failres.

    Change-Id: I92d4d886cafed5faeb2a6d0b224f4b5dc8522100
    Related-Bug: #1314313

Kyle Mestery (mestery)
Changed in neutron:
milestone: juno-1 → juno-2
Kyle Mestery (mestery)
Changed in neutron:
milestone: juno-2 → juno-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115377

Changed in neutron:
assignee: Sridar Kandaswamy (skandasw) → Armando Migliaccio (armando-migliaccio)
status: Confirmed → In Progress
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
Changed in neutron:
assignee: nobody → Armando Migliaccio (armando-migliaccio)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/115377
Reason: this test didn't pan out

Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → nobody
Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
importance: Low → Medium
Revision history for this message
Sumit Naiksatam (snaiksat) wrote :

Whats the reason for raising the priority of this? The last time I checked this was only seen in the dvr experimental gate jobs.

Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-3 → juno-rc1
Changed in neutron:
importance: Medium → Low
Revision history for this message
Kyle Mestery (mestery) wrote :

Moving low priority bug out of Juno-RC1.

Changed in neutron:
milestone: juno-rc1 → kilo-1
Revision history for this message
Matt Riedemann (mriedem) wrote :

We don't have hits on this in elastic-recheck anymore and it doesn't appear that anyone is working on this, so marking it incomplete.

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → nobody
status: In Progress → Confirmed
status: Confirmed → Incomplete
Revision history for this message
Joe Gordon (jogo) wrote :

Since this bug hasn't been seen marking as closed. If this is seen again feel free to re-open.

Changed in neutron:
status: Incomplete → Invalid
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-1 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.