[SRU] Fix Race between L3 agent and neutron-ns-cleanup

Bug #1623664 reported by Gautam Kulkarni on 2016-09-14
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
neutron
Undecided
Unassigned
neutron (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Yakkety
Undecided
Unassigned
Zesty
Undecided
Unassigned
neutron-lbaas (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Yakkety
Undecided
Unassigned

Bug Description

I suspect a race between the neutron L3 agent and the neutron-netns-cleanup script, which runs as a CRON job in Ubuntu. Here's a stack trace in the router delete code path:

2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager [-] Error during notification for neutron.agent.metadata.driver.before_router_removed router, before_delete
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Traceback (most recent call last):
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/callbacks/manager.py", line 141, in _notify_loop
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager callback(resource, event, trigger, **kwargs)
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/driver.py", line 176, in before_router_removed
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager router.iptables_manager.apply()
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py", line 423, in apply
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager return self._apply()
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py", line 431, in _apply
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager return self._apply_synchronized()
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py", line 457, in _apply_synchronized
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager save_output = self.execute(args, run_as_root=True)
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 159, in execute
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager raise RuntimeError(m)
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager RuntimeError:
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-69ef3d5c-1ad1-42fb-8a1e-8d949837bbf8', 'iptables-save']
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Exit code: 1
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Stdin:
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Stdout:
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager Stderr: Cannot open network namespace "qrouter-69ef3d5c-1ad1-42fb-8a1e-8d949837bbf8": No such file or directory
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager
2016-08-03 03:30:03.392 2595 ERROR neutron.callbacks.manager
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent [-] Error while deleting router 69ef3d5c-1ad1-42fb-8a1e-8d949837bbf8
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 344, in _safe_router_removed
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent self._router_removed(router_id)
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 360, in _router_removed
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent self, router=ri)
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/callbacks/registry.py", line 44, in notify
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent _get_callback_manager().notify(resource, event, trigger, **kwargs)
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/callbacks/manager.py", line 123, in notify
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent raise exceptions.CallbackFailure(errors=errors)
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent CallbackFailure: Callback neutron.agent.metadata.driver.before_router_removed failed with "
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-69ef3d5c-1ad1-42fb-8a1e-8d949837bbf8', 'iptables-save']
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Exit code: 1
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Stdin:
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Stdout:
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent Stderr: Cannot open network namespace "qrouter-69ef3d5c-1ad1-42fb-8a1e-8d949837bbf8": No such file or directory
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent "
2016-08-03 03:30:03.393 2595 ERROR neutron.agent.l3.agent

In this case, the cleanup first deleted the qrouter namespace it found to be empty (not containing any netdevs other than lo). The router delete flow attempts to delete iptables rules within the namespace before deleting the namespace itself. However, if the namespace is deleted first, the iptables-save command on a non-existent namespace fails. The resulting exception prevents the router delete flow from succeeding and the L3 agent gets stuck in a failure loop.

Can somebody confirm if this is a known issue or if I've misunderstood the problem. Assuming my analysis is correct, would the following fix work?

diff --git a/neutron/agent/l3/agent.py b/neutron/agent/l3/agent.py
index b096091..8d3e8ae 100644
--- a/neutron/agent/l3/agent.py
+++ b/neutron/agent/l3/agent.py
@@ -358,8 +358,16 @@ class L3NATAgent(firewall_l3_agent.FWaaSL3AgentRpcCallback,
             self.namespaces_manager.ensure_router_cleanup(router_id)
             return

- registry.notify(resources.ROUTER, events.BEFORE_DELETE,
- self, router=ri)
+ try:
+ registry.notify(resources.ROUTER, events.BEFORE_DELETE,
+ self, router=ri)
+ except Exception as e:
+ ns_err = "Cannot open network namespace qrouter-" + router_id
+ if ns_err not in e:
+ raise
+ else:
+ LOG.warn(_LW("Namespace for router %s already deleted"),
+ router_id)

Assaf Muller (amuller) wrote :

I don't think running the netns-cleanup script in a CRON job is desired behavior.

Gautam Kulkarni (kulkarni) wrote :

Thanks, Assaf, for your quick response. Do you think that this is an Ubuntu package bug? We are not explicitly creating the CRON job. Would the fix be to remove the CRON job? Thanks.

I agree Assaf. Why do we need to run neutron-netns-cleanup with l3-agent? Although I don't know ubuntu package behavior.

Changed in neutron:
status: New → Invalid
Akihiro Motoki (amotoki) wrote :

The recommended solution is to disable or delete Ubuntu cron job of neutron-netns-cleanup.
neutron-netns-cleanup command searches network namespaces only with a loopback interface and deletes such namespaces. This potentially creates a situation where undesired netns deletion occurs.

I think the cron job was added when the neutron default behavior was not to delete namespaces. it was long ago.

In addition, Ubuntu package registers three neutron-netns-cleanup cron jobs (for dhcp-agent, l3-agent, lbaas-agent), but all of three do the same job. This also looks tricky.

James Dennis (jtmes) wrote :

Anyone from the cloud archive project have an opinion on a fix for this?

Corey Bryant (corey.bryant) wrote :

These cron jobs are quite old (introduced in grizzly) so if we can drop them now I'd like to do that.

Corey Bryant (corey.bryant) wrote :

I need to confirm for sure that these cron jobs are no longer needed before dropping them. If neutron is handling namespace cleanup now then it sounds like we can, however I need to dig in and confirm. I'll do that soon.

Chuck Short (zulcss) on 2017-03-27
summary: - Race between L3 agent and neutron-ns-cleanup
+ [SRU] Race between L3 agent and neutron-ns-cleanup

[Description]
A race condition on L3 agents can occur when the cron jobs run when the routing tables are empty. This results in the traceback above.

[Test Case]
Run tempest and then check the namespaces after running neutron-ns-cleanup

[Regression Potential]
None

Cron job for -cleanup? You should be brave to do that... Please don't.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron-lbaas - 2:10.0.0-0ubuntu2

---------------
neutron-lbaas (2:10.0.0-0ubuntu2) zesty; urgency=medium

  * debian/neutron-lbaas-common.install: Remove cron jobs since
    it will cause a race with L3 agents and neutron. (LP: #1623664)
  * d/control: Remove duplicate depenendency.

 -- Chuck Short <email address hidden> Mon, 27 Mar 2017 13:04:16 -0400

Changed in neutron-lbaas (Ubuntu):
status: New → Fix Released
Łukasz Zemczak (sil2100) wrote :

Can we please have the bug description updated to include SRU information? Thank you!
https://wiki.ubuntu.com/StableReleaseUpdates#SRU_Bug_Template

Chuck Short (zulcss) on 2017-04-06
summary: - [SRU] Race between L3 agent and neutron-ns-cleanup
+ [SRU] Fix Race between L3 agent and neutron-ns-cleanup
Łukasz Zemczak (sil2100) wrote :

Another thing besides the bug description needing to follow the SRU template - the same change that has been released to zesty (version 2:10.0.0-0ubuntu4) fails to build from source. Build failure:

https://launchpad.net/ubuntu/+source/neutron/2:10.0.0-0ubuntu4/+build/12366961

Please make sure that the devel version is buildable and migrates to the release pocket.

Reviewed: https://review.openstack.org/450271
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3889b0f214fbb1d32531395cff232cf57daf7b17
Submitter: Jenkins
Branch: master

commit 3889b0f214fbb1d32531395cff232cf57daf7b17
Author: Saverio Proto <email address hidden>
Date: Mon Mar 27 16:39:16 2017 +0200

    Dont try to apply iptables rules in a endless loop

    If the namespace does not exist the current behavior
    is to try to apply the iptables rules forever in a
    endless loop. This fills up the logs on the network
    node and leads to outage.

    Change-Id: I628b18a66f9478d7349fa1817431aae2f62ee105
    Related-bug: #1623664
    Related-bug: #1573073

James Page (james-page) wrote :

@Lukasz

I working to resolve the autopkgtest failure caused by 10.0.0-0ubuntu5

James Page (james-page) wrote :

@Lukasz

I've uploaded a new version of neutron 10.0.0-0ubuntu5.1 to unapproved queue (it will show as a sync from biletto).

This resolves the autopkgtest failures and does actually build.

Reviewed: https://review.openstack.org/457468
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=44c3ee8cb2bc9adb6f4f95ba3f6be4c3abe344df
Submitter: Jenkins
Branch: stable/newton

commit 44c3ee8cb2bc9adb6f4f95ba3f6be4c3abe344df
Author: Saverio Proto <email address hidden>
Date: Mon Mar 27 16:39:16 2017 +0200

    Dont try to apply iptables rules in a endless loop

    If the namespace does not exist the current behavior
    is to try to apply the iptables rules forever in a
    endless loop. This fills up the logs on the network
    node and leads to outage.

    Change-Id: I628b18a66f9478d7349fa1817431aae2f62ee105
    Related-bug: #1623664
    Related-bug: #1573073
    (cherry picked from commit 3889b0f214fbb1d32531395cff232cf57daf7b17)

tags: added: in-stable-newton

Hello Gautam, or anyone else affected,

Accepted neutron into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:10.0.0-0ubuntu5.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Zesty):
status: New → Fix Committed
tags: added: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:10.0.0-0ubuntu5.1

---------------
neutron (2:10.0.0-0ubuntu5.1) zesty; urgency=medium

  * d/rules: Revert the majority of changes made under 2:10.0.0-0ubuntu5
    as these break default configuration and are the root cause of
    the autopkgtest failures.

neutron (2:10.0.0-0ubuntu5) zesty; urgency=medium

  * d/rules: Fix failing unit tests.

neutron (2:10.0.0-0ubuntu4) zesty; urgency=medium

  * d/neutron-common.install, d/neutron-dhcp-agent.install:
    Remove cron jobs since they will cause a race when
    using an L3 agent. (LP: #1623664)

 -- James Page <email address hidden> Wed, 19 Apr 2017 11:49:10 +0100

Changed in neutron (Ubuntu):
status: New → Fix Released

Reviewed: https://review.openstack.org/457467
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8fed89b12b3eb6d3c73a4f53dcf66f85f5bd4e8f
Submitter: Jenkins
Branch: stable/ocata

commit 8fed89b12b3eb6d3c73a4f53dcf66f85f5bd4e8f
Author: Saverio Proto <email address hidden>
Date: Mon Mar 27 16:39:16 2017 +0200

    Dont try to apply iptables rules in a endless loop

    If the namespace does not exist the current behavior
    is to try to apply the iptables rules forever in a
    endless loop. This fills up the logs on the network
    node and leads to outage.

    Change-Id: I628b18a66f9478d7349fa1817431aae2f62ee105
    Related-bug: #1623664
    Related-bug: #1573073
    (cherry picked from commit 3889b0f214fbb1d32531395cff232cf57daf7b17)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers