Kernel panic or ovs-vswitch crash when running DVR functional test

Bug #1418097 reported by Assaf Muller on 2015-02-04
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Assaf Muller

Bug Description

Still early in triage process. Running:
tox -e dsvm-functional neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips

Results in a kernel panic or ovs-vswitchd crash, with this patch:
https://review.openstack.org/#/c/151882/

Rebasing one patch earlier results in smooth test runs.

If I don't get a kernel panic, then ovs-vswitchd crashes, at which point most ovs-vsctl commands fail. Restarting the openvswitch service doesn't seem to solve it, only restarting the machine.

Note: It's not necessarily that specific test that is causing issues, but it is one that reproduces the issue reliably. Running the non-DVR test case (L3AgentTestCase) works fine. So does running the OVS lib functional tests.

My devstack VM is running on:
* Fedora 20
* openvswitch-2.3.1-1.fc20.x86_64
* kernel-3.17.2-200.fc20.x86_64

I checked with team mates, the same test on the same code passes with:
* CentOS 7
* openvswitch-2.1.2-2.el7.centos.1.x86_64
* kernel-3.10.0-123.6.3.el7.x86_64

This may happen as a result of a Neutron code change in the patch above, possibly coupled with a regression in a recent OVS version. I think it's clear to assume that this is not only a test-only issue, but could happen in production as well.

Assaf Muller (amuller) on 2015-02-04
description: updated
Assaf Muller (amuller) wrote :

I'd appreciate other people running that test once or twice and posting their OS, kernel and OVS version.

description: updated
Kyle Mestery (mestery) wrote :
Download full text (57.8 KiB)

I can't even get this functional test to run (see below):

[vagrant@localhost neutron]$ sudo tox -e dsvm-functional neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
dsvm-functional develop-inst-nodeps: /opt/stack/neutron
dsvm-functional runtests: PYTHONHASHSEED='3075550708'
dsvm-functional runtests: commands[0] | sh tools/pretty_tox.sh neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
{0} neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips [3.104170s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "neutron/tests/functional/agent/test_l3_agent.py", line 520, in test_dvr_router_lifecycle_without_ha_with_snat_with_fips
        self._dvr_router_lifecycle(enable_ha=False, enable_snat=True)
      File "neutron/tests/functional/agent/test_l3_agent.py", line 558, in _dvr_router_lifecycle
        router = self.manage_router(self.agent, router_info)
      File "neutron/tests/functional/agent/test_l3_agent.py", line 108, in manage_router
        ri = self._create_router(agent, router)
      File "neutron/tests/functional/agent/test_l3_agent.py", line 112, in _create_router
        agent._process_added_router(router)
      File "neutron/agent/l3/agent.py", line 1091, in _process_added_router
        self.process_router(ri)
      File "neutron/common/utils.py", line 345, in call
        self.logger(e)
      File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 82, in __exit__
        six.reraise(self.type_, self.value, self.tb)
      File "neutron/common/utils.py", line 342, in call
        return func(*args, **kwargs)
      File "neutron/agent/l3/agent.py", line 598, in process_router
        self._process_internal_ports(ri)
      File "neutron/agent/l3/agent.py", line 442, in _process_internal_ports
        self.internal_network_added(ri, p)
      File "neutron/agent/l3/agent.py", line 958, in internal_network_added
        ri.is_ha)
      File "neutron/agent/l3/agent.py", line 935, in _internal_network_added
        prefix=prefix)
      File "neutron/agent/linux/interface.py", line 229, in plug
        self.check_bridge_exists(bridge)
      File "neutron/agent/linux/interface.py", line 168, in check_bridge_exists
        raise exceptions.BridgeDoesNotExist(bridge=bridge)
    BridgeDoesNotExist: Bridge test-br21005152 does not exist.
    Traceback (most recent call last):
    _StringException: Empty attachments:
      pythonlogging:''
      pythonlogging:'neutron.api.extensions'
      stderr

    stdout: {{{
    2015-02-04 16:51:01.104 1818 INFO neutron.common.config [-] Logging enabled!
    2015-02-04 16:51:01.106 1818 INFO neutron.common.config [-] /usr/lib/python2.7/site-packages/subunit/run.py version 2015.1.dev546
    2015-02-04 16:51:01.200 1818 ERROR neutron.agent.linux.ovsdb_vsctl [-] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--may-exist', 'add-br', 'test-br21005152'].
    2015-02-04 16:51:01.200 1818 TRACE neutron.agent.linux.ovsdb_vsct...

Hi Assaf, are you saying that reverting patch https://review.openstack.org/#/c/151882/ leads you to a successful run? I can't really imagine what harm that patch could do. I wonder if git bisect would lead to a difference culprit.

Assaf Muller (amuller) wrote :

Armando: What I did was rebase back to https://review.openstack.org/#/c/145565/. Everything works. Then I rebase forward to https://review.openstack.org/#/c/151882/ and the mess begins. I agree that that patch looks innocent.

Adolfo Duarte (adolfo-duarte) wrote :

Don't know if I did this right, but here is what I executed
git checkout master
it fetch https://review.openstack.org/openstack/neutron refs/changes/65/145565/27 && git checkout FETCH_HEAD
git checkout -b patch145565
sudo tox -e dsvm-functional neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
(PASS)
git fetch https://review.openstack.org/openstack/neutron refs/changes/82/151882/1 && git checkout FETCH_HEAD
git checkout -b patch-151882
git rebase patch145565
sudo tox -e dsvm-functional neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips
PASS

my vm is:
ubuntu
3.13.0-39-generic
sdn@ubuntu:/opt/dev/neutron$ ovs-vsctl -V
ovs-vsctl (Open vSwitch) 2.0.2
Compiled Aug 15 2014 14:31:02
sdn@ubuntu:/opt/dev/neutron$

Carl Baldwin (carl-baldwin) wrote :

@Assaf So, when https://review.openstack.org/#/c/151882/ is where you rebase to, you mean that f97a9cb3 is the direct parent of your commit? And, when you rebase back to https://review.openstack.org/#/c/145565/ you mean that 59795ab6 is your direct parent? If that is the case, then there is nothing between these two commits and 151882 must be the culprit (though I still cannot imagine why).

Is it possible that you are you rebasing to the merge commit that merges each of these patches? Those would be c8b156fb for 145565 and 88d6d615 for 151882. Hmmm, I guess they merged essentially at the same time and so there is nothing that merged between them. Darn!

Could you be more specific about exactly what commits you were based to just in case there is some room for something else to slip in?

Carl Baldwin (carl-baldwin) wrote :

Just a quick follow-up. No matter if @Assaf based off of the commits or the merge commits or some combination of them, there is no room for any other changes to sneak in:

$ git show-branch f97a9cb3 59795ab6 c8b156fb 88d6d615
! [f97a9cb3] Don't pass the port down to the floating ip processing
 ! [59795ab6] Encapsulate DVR Fip namespace
  ! [c8b156fb] Merge "Encapsulate DVR Fip namespace"
   ! [88d6d615] Merge "Don't pass the port down to the floating ip processing"
----
   - [88d6d615] Merge "Don't pass the port down to the floating ip processing"
+ + [f97a9cb3] Don't pass the port down to the floating ip processing
  -- [c8b156fb] Merge "Encapsulate DVR Fip namespace"
++++ [59795ab6] Encapsulate DVR Fip namespace

So, there goes my theory on that.

OK, so I also experience it. I run neutron.tests.functional.agent.test_l3_agent.L3AgentTestCase.test_conntrack_disassociate_fip for multiple times, it works; then I execute neutron.tests.functional.agent.test_l3_agent.TestDvrRouter.test_dvr_router_lifecycle_without_ha_with_snat_with_fips once, it generates kernel stack trace [1] but still succeeds. Then on next invocation any test using OVS fails.

[1]: https://retrace.fedoraproject.org/faf/reports/566500/

My environment:
- Fedora 21;
- kernel 3.18.5-201.fc21.x86_64;
- openvswitch 2.3.1 2.git20150113.fc21.

Assaf Muller (amuller) on 2015-02-12
Changed in neutron:
status: New → Confirmed
Assaf Muller (amuller) wrote :

I narrowed it down to this crashing script:

ovs-vsctl --timeout=10 --oneline --format=json -- --may-exist add-br test-br
ip netns add test-ns
ovs-vsctl --timeout=10 --oneline --format=json -- --if-exists del-port test-port -- add-port test-br test-port -- set Interface test-port type=internal external_ids:iface-id=42 external_ids:iface-status=active external_ids:attached-mac=ca:fe:de:ad:be:ee
ip link set test-port netns test-ns
ip netns delete test-ns
ovs-vsctl --timeout=10 --oneline --format=json -- --if-exists del-br test-br

Opening a bug on OVS...

Assaf Muller (amuller) wrote :

I opened a bug on OVS. A few thoughts, but first let's translate that script to English:

1) Create a bridge
2) Create a namespace
3) Create a port on the bridge
4) Move the port to the namespace
5) Delete the namespace
6) Delete the bridge (Internally, this deletes the port, and that's where the crash happens)

If we reverse steps 5 and 6, the issue is avoided. This is not to say that we're not facing an OVS bug, but still, we could work around it by the time the potential fix hits OVS and is (Maybe) backported to relevant versions.

Explicitly deleting the interface before deleting the namespace also avoids the issue.

This is to suggest that there's some DVR interface (qg, fpr, frp?) that is not being deleted explicitly during the offending DVR functional test, and thus when a DVR router is deleted outside of a testing context.

To conclude, while we're definitely facing an OVS bug, I think that we have a Neutron DVR bug here as well that should be fixed regardless of any OVS fixes.

Assaf,

Great analysis.

Can you check that the logstash query below would be enough to capture this in the gate?

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiQnJpZGdlRG9lc05vdEV4aXN0OiBCcmlkZ2VcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQyMzc4MjcyMDM5M30=

Do you think that this is issue is specifically tied to the distro/kernel/ovs version?

I agree that DVR can be tuned to work around the issue, but I am trying to understand its urgency

Changed in neutron:
importance: Undecided → Medium
tags: added: l3-dvr-backlog

Assaf, could you please link to OVS bug?

Assaf, your script does not seem to trigger the crash on my system though. I've copy pasted it into a shell file and executed with sudo, nothing crashed. Though the DVR test still triggers it.

BTW I've upgraded kernel today to 3.18.6-200.fc21.x86_64.

Kyle Mestery (mestery) wrote :

Moving this to high since it's a kernel crash.

Changed in neutron:
importance: Medium → High
tags: added: ovs
Assaf Muller (amuller) wrote :

@Ihar: The OVS-discuss list archives haven't updated yet, but the OVS guys received the bug report and are on it. I'll link it here when the link is available.

@Armando: A logstash query would not be effective because it seems like this crash doesn't happen on OVS versions that are on Ubuntu right now (I'm assuming this because this issue is not manifesting at the gate, and the only guys that are hitting this issue are running Fedora).

Anyhow, I submitted a patch. It seems like the qg device was never deleted for DVR SNAT routers. While the kernel panic should not happen in production (Because the device is actually not deleted, and the external bridge should not be deleted either, unlike functional tests that clean up after themselves). However, since routers tend to come and go over a long period of time, I guess we don't really want lingering qg devices from long gone routers.

Assaf Muller (amuller) wrote :

Looking back at the patch that I claimed that started it all, I now understand why it did. Looking at:
https://review.openstack.org/#/c/151882/1/neutron/agent/l3/agent.py, line 891, it overrides the local variable 'interface_name' resulting in the wrong device to be unplugged later at line 911.

Assaf Muller (amuller) on 2015-02-16
Changed in neutron:
assignee: nobody → Assaf Muller (amuller)
Changed in neutron:
status: Confirmed → In Progress
Kyle Mestery (mestery) on 2015-02-17
Changed in neutron:
milestone: none → kilo-3

Reviewed: https://review.openstack.org/155813
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a172e200685567a71844f65ae09d79635073a2f8
Submitter: Jenkins
Branch: master

commit a172e200685567a71844f65ae09d79635073a2f8
Author: Assaf Muller <email address hidden>
Date: Fri Feb 13 11:35:02 2015 -0500

    Delete qg device during DVR-SNAT router deletion

    In the DVR SNAT case, the 'qg' device was not deleted because
    of patch:
    https://review.openstack.org/#/c/151882/

    During functional testing, the device is deleted
    during the external bridge deletion. Because that happens after
    the SNAT namespace is already deleted, it can cause a kernel
    panic or ovs-vswitchd crash for certain OVS versions.

    Also added assertions that all router interfaces were properly
    cleaned up during functional testing, and enabled the unit tests
    to catch this type of error.

    Change-Id: Ica86a030860aa967d5b685b5bfb5546a85b2a809
    Closes-Bug: #1418097

Changed in neutron:
status: In Progress → Fix Committed
Assaf Muller (amuller) wrote :

Some good news, the issue has been fixed on the OVS side of things.

Here's the original bug report:
http://openvswitch.org/pipermail/discuss/2015-February/016561.html

And the fix to the kernel:
http://marc.info/?l=linux-netdev&m=142446436200628&w=2

Thierry Carrez (ttx) on 2015-03-19
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-04-30
Changed in neutron:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers