KeyError prevents openvswitch agent from starting

Bug #1526974 reported by Edgar Cantu
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley

Bug Description

On Liberty I ran into a situation where the openvswitch agent won't start and fails with the following stack trace:

2015-12-16 16:01:42.852 10772 CRITICAL neutron [req-afb4e123-1940-48df-befc-9319516152b5 - - - - -] KeyError: 8
2015-12-16 16:01:42.852 10772 ERROR neutron Traceback (most recent call last):
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/bin/neutron-openvswitch-agent", line 11, in <module>
2015-12-16 16:01:42.852 10772 ERROR neutron sys.exit(main())
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2015-12-16 16:01:42.852 10772 ERROR neutron agent_main.main()
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 49, in main
2015-12-16 16:01:42.852 10772 ERROR neutron mod.main()
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/ovs_ofctl/main.py", line 36, in main
2015-12-16 16:01:42.852 10772 ERROR neutron ovs_neutron_agent.main(bridge_classes)
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1913, in main
2015-12-16 16:01:42.852 10772 ERROR neutron agent = OVSNeutronAgent(bridge_classes, **agent_config)
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 302, in __init__
2015-12-16 16:01:42.852 10772 ERROR neutron self._restore_local_vlan_map()
2015-12-16 16:01:42.852 10772 ERROR neutron File "/opt/neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 358, in _restore_local_vlan_map
2015-12-16 16:01:42.852 10772 ERROR neutron self.available_local_vlans.remove(local_vlan)
2015-12-16 16:01:42.852 10772 ERROR neutron KeyError: 8
2015-12-16 16:01:42.852 10772 ERROR neutron

Somehow the ovs table ended up with 2 ports with the same local vlan tag.

# ovs-vsctl -- --columns=name,tag,other_config list Port | grep -E 'qvob7ba561c-e5|qvod3e1f984-0e' -A 2

name : "qvob7ba561c-e5"
tag : 8
other_config : {net_uuid="fb33e234-714d-44f8-8728-1a466ef5aca0", network_type=vxlan, physical_network=None, segmentation_id="5969"}
--
name : "qvod3e1f984-0e"
tag : 8
other_config : {net_uuid="47e0f11c-7aa4-4eb4-97dc-0ef4e064680c", network_type=vxlan, physical_network=None, segmentation_id="5836"}

Additionally, I noticed the ofport for one of them was -1.

# ovs-vsctl -- --columns=name,ofport,external_ids list Interface | grep -E 'qvob7ba561c-e5|qvod3e1f984-0e' -A 2

name : "qvod3e1f984-0e"
ofport : 20
external_ids : {attached-mac="fa:16:3e:d7:eb:05", iface-id="d3e1f984-0e4f-4d39-a074-1c0809ad864c", iface-status=active, vm-uuid="a00032c8-f516-42e3-865e-1988768bab84"}
--
name : "qvob7ba561c-e5"
ofport : -1
external_ids : {attached-mac="fa:16:3e:a9:c3:69", iface-id="b7ba561c-e5a2-4128-b36c-9484a763f4de", iface-status=active, vm-uuid="71873533-a4ab-4af6-8ace-e75c60b828f9"}

I'm not sure if this is relevant, but the VM that has -1 ofport is in the following state

+--------------------------------------+------------------------------------------------------+----------------------------------+-----------+------------+-------------+-------------------------------------------------------+
| ID | Name | Tenant ID | Status | Task State | Power State | Networks |
+--------------------------------------+------------------------------------------------------+----------------------------------+-----------+------------+-------------+-------------------------------------------------------+
| 71873533-a4ab-4af6-8ace-e75c60b828f9 | test-instance-1 | 99e641ee27434c36b4f83fbee0599e67 | SHUTOFF | - | Shutdown | |
+--------------------------------------+------------------------------------------------------+----------------------------------+-----------+------------+-------------+-------------------------------------------------------+

------------------------------------------------------------------------------------------------------------------------------------
Neutron Version: 69d531565dcd180f6f1141bad143b3ea4dcd7ade

Operating System: CentOS Linux 7 (Core)
Kernel: Linux 3.10.0-229.11.1.el7.x86_64
Architecture: x86_64

ovs-vsctl (Open vSwitch) 2.3.1
Compiled Dec 26 2014 15:35:14
DB Schema 7.6.2

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Over to Kevin to triage

tags: added: ovs
Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
Revision history for this message
Brian Haley (brian-haley) wrote :

We also saw this issue recently, and found duplicate entries in the ovs table:

name : "tap4ef9b61a-4f"
tag : 35
other_config : {net_uuid="e1bac7c4-446f-4b1b-a24a-1a65307c6402", network_type=vxlan, physical_network=None, segmentation_id="1116", tag="35"}

name : "sg-3e4a8a9e-3b"
tag : 35
other_config : {net_uuid="e1bac7c4-446f-4b1b-a24a-1a65307c6402", network_type=vxlan, physical_network=None, segmentation_id="1116", tag="35"}

Not sure if putting a try/except around the .remove() is the right thing as that might just mask the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/325370

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Brian Haley (brian-haley)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/325370
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d2508163cfcede20df641239be081fe63c79150b
Submitter: Jenkins
Branch: master

commit d2508163cfcede20df641239be081fe63c79150b
Author: Brian Haley <email address hidden>
Date: Fri Jun 3 11:34:35 2016 -0400

    OVS: don't throw KeyError when duplicate VLAN tags exist

    In _restore_local_vlan_map() we can have two ports with the
    same VLAN tag, but trying to remove the second will throw
    a KeyError, causing the agent to not start. Use discard()
    instead so we only remove an entry if it's there.

    Closes-bug: #1526974
    Change-Id: I479c693f490c704c5b6c1462e9ab236684e9c259

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/327973

Assaf Muller (amuller)
tags: added: mitaka-backport-potential
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

Brian,

in your case (comment #2) ports have the same net_uuid.
didn't "net_uuid not in self._local_vlan_hints" worked?

Revision history for this message
Brian Haley (brian-haley) wrote :

Yamamoto-san,

Yes, my paste does show the uuid's the same, I might have just cut/pasted the wrong entry from the log file, but we definitely saw a Keyerror triggered in that code like the original reporter did.

Without much more to go on I went for the simple fix to keep the agent running.

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

Brian,

ok, thank you for clarification.

i doubt if it's a good thing to keep the agent running, with conflicting vlan tags, ie. broken isolation.

Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/333642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/mitaka)

Change abandoned by Brian Haley (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/327973

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/333642
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=db817fd5434fb87afe1f8e8e05b282d68ff9dc31
Submitter: Jenkins
Branch: master

commit db817fd5434fb87afe1f8e8e05b282d68ff9dc31
Author: Kevin Benton <email address hidden>
Date: Wed Jun 22 14:57:39 2016 -0700

    Skip INVALID and UNASSIGNED ofport in vlan restore

    get_vif_ports returns ports with INVALID and UNASSIGNED
    ofports and get_vif_port_set does not. The main scan_ports
    loop uses the latter so any INVALID ofports (i.e. ofport == -1)
    will be treated as removed and have their local VLANs reclaimed.
    So an INVALID ofport could have the same local VLAN as a new
    port that was added after it had been reclaimed.

    This was causing an error in the _restore_local_vlan_map function
    since it was using get_vif_ports which would cause it to process
    INVALID ports as well so it could get two network UUIDs using the
    same VLAN.

    This fixes it by skipping INVALID and UNASSIGNED ofports in the
    vlan restoration so it matches the behavior of scan_ports
    (which is responsible for deciding which ports are added/removed
    for VLAN allocation).

    Closes-Bug: #1526974
    Change-Id: I9d722fa4fabd467ded44d9cd291a3fa4d1af90f6

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/337806

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/337807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/337806
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b49841e126a7fbed6f6750073d831451ba374ffa
Submitter: Jenkins
Branch: stable/mitaka

commit b49841e126a7fbed6f6750073d831451ba374ffa
Author: Kevin Benton <email address hidden>
Date: Wed Jun 22 14:57:39 2016 -0700

    Skip INVALID and UNASSIGNED ofport in vlan restore

    get_vif_ports returns ports with INVALID and UNASSIGNED
    ofports and get_vif_port_set does not. The main scan_ports
    loop uses the latter so any INVALID ofports (i.e. ofport == -1)
    will be treated as removed and have their local VLANs reclaimed.
    So an INVALID ofport could have the same local VLAN as a new
    port that was added after it had been reclaimed.

    This was causing an error in the _restore_local_vlan_map function
    since it was using get_vif_ports which would cause it to process
    INVALID ports as well so it could get two network UUIDs using the
    same VLAN.

    This fixes it by skipping INVALID and UNASSIGNED ofports in the
    vlan restoration so it matches the behavior of scan_ports
    (which is responsible for deciding which ports are added/removed
    for VLAN allocation).

    Closes-Bug: #1526974
    Change-Id: I9d722fa4fabd467ded44d9cd291a3fa4d1af90f6
    (cherry picked from commit db817fd5434fb87afe1f8e8e05b282d68ff9dc31)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/337807
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=17fd1eaeb182eff94266266188ae9a45be2e23a0
Submitter: Jenkins
Branch: stable/liberty

commit 17fd1eaeb182eff94266266188ae9a45be2e23a0
Author: Kevin Benton <email address hidden>
Date: Wed Jun 22 14:57:39 2016 -0700

    Skip INVALID and UNASSIGNED ofport in vlan restore

    get_vif_ports returns ports with INVALID and UNASSIGNED
    ofports and get_vif_port_set does not. The main scan_ports
    loop uses the latter so any INVALID ofports (i.e. ofport == -1)
    will be treated as removed and have their local VLANs reclaimed.
    So an INVALID ofport could have the same local VLAN as a new
    port that was added after it had been reclaimed.

    This was causing an error in the _restore_local_vlan_map function
    since it was using get_vif_ports which would cause it to process
    INVALID ports as well so it could get two network UUIDs using the
    same VLAN.

    This fixes it by skipping INVALID and UNASSIGNED ofports in the
    vlan restoration so it matches the behavior of scan_ports
    (which is responsible for deciding which ports are added/removed
    for VLAN allocation).

    Closes-Bug: #1526974
    Change-Id: I9d722fa4fabd467ded44d9cd291a3fa4d1af90f6
    (cherry picked from commit db817fd5434fb87afe1f8e8e05b282d68ff9dc31)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b2

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
tags: removed: mitaka-backport-potential
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.1.2

This issue was fixed in the openstack/neutron 7.1.2 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 8.2.0

This issue was fixed in the openstack/neutron 8.2.0 release.

Revision history for this message
Esha Seth (eshaseth) wrote :

I am still seeing the issue and opened https://bugs.launchpad.net/neutron/+bug/1625305 for the same.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.