neutron-openvswitch-agent is crashing due to KeyError in _restore_local_vlan_map()

Bug #1625305 reported by Esha Seth
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Incomplete
High
sarvani konda

Bug Description

Neutron openvswitch agent is unable to restart because vms with untagged/flat networks (tagged 3999) cause issue with _restore_local_vlan_map

Loaded agent extensions: []
2016-09-06 07:57:39.682 70085 CRITICAL neutron [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron Traceback (most recent call last):
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 28, in <module>
2016-09-06 07:57:39.682 70085 ERROR neutron sys.exit(main())
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 235, in __init__
2016-09-06 07:57:39.682 70085 ERROR neutron self._restore_local_vlan_map()
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 356, in _restore_local_vlan_map
2016-09-06 07:57:39.682 70085 ERROR neutron self.available_local_vlans.remove(local_vlan)
2016-09-06 07:57:39.682 70085 ERROR neutron KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron
2016-09-06 07:57:39.684 70085 INFO oslo_rootwrap.client [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] Stopping rootwrap daemon process with pid=70197

Esha Seth (eshaseth)
summary: neutron-openvswitch-agent is crashing due to KeyError in
- ._restore_local_vlan_map()
+ _restore_local_vlan_map()
description: updated
Revision history for this message
Brian Haley (brian-haley) wrote :

Looks like a duplicate of https://bugs.launchpad.net/neutron/+bug/1526974 - please see that bug for reference and fixed sent to all the releases.

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Could be a regression caused by https://review.openstack.org/#/c/349990/ ?

Revision history for this message
Esha Seth (eshaseth) wrote :

I am still seeing the issue in neutron 9.0.0.0 (newton) and also in stable mitaka. So either 1526974 should be reopened or fix persued here.

Revision history for this message
Brian Haley (brian-haley) wrote :

I removed the duplicate since it could be a new issue after the change Ihar mentioned.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Esha: can you provide more detailed steps to repro?

Changed in neutron:
importance: Undecided → High
tags: added: newton-rc-potential
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I can't see how could be a culprit though.

_restore_local_vlan_map() which is called during init was not even touched by [1]

[1] https://review.openstack.org/#/c/349990/

Revision history for this message
Esha Seth (eshaseth) wrote :

These were the 2 vms I had running on the host.

[root@ip9-114-248-165 ibm]# virsh list --all
 Id Name State
----------------------------------------------------
 5 RHEL73-Jag-1a1cb620-0000000a running
 8 rhel_jag1-0f67d94d-00000002 running

 [root@ip9-114-248-165 ~]# ovs-vsctl show
6fd4b576-698d-41bc-b226-241145207524
    Bridge br-int
        fail_mode: secure
        Port "tap85564a42-ae"
            tag: 3999
            Interface "tap85564a42-ae"
        Port "tap562a7922-11"
            tag: 3999
            Interface "tap562a7922-11"
        Port br-int
            Interface br-int
                type: internal
        Port "int-default0"
            Interface "int-default0"
                type: patch
                options: {peer="phy-default0"}
    Bridge "default0"
        fail_mode: secure
        Port "phy-default0"
            Interface "phy-default0"
                type: patch
                options: {peer="int-default0"}
        Port "enP3p9s0f0"
            Interface "enP3p9s0f0"
        Port "default0"
            Interface "default0"
                type: internal
    ovs_version: "2.3.1-git3282e51"

These 2 were based on a flat network.

Then the KVM host was to be used again to add more networks.

Neutron openvswitch agent is unable to restart because vms with untagged/flat networks (tagged 3999) cause issue with _restore_local_vlan_map

Loaded agent extensions: []
2016-09-06 07:57:39.682 70085 CRITICAL neutron [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron Traceback (most recent call last):
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 28, in <module>
2016-09-06 07:57:39.682 70085 ERROR neutron sys.exit(main())
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 235, in __init__
2016-09-06 07:57:39.682 70085 ERROR neutron self._restore_local_vlan_map()
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 356, in _restore_local_vlan_map
2016-09-06 07:57:39.682 70085 ERROR neutron self.available_local_vlans.remove(local_vlan)
2016-09-06 07:57:39.682 70085 ERROR neutron KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron
2016-09-06 07:57:39.684 70085 INFO oslo_rootwrap.client [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] Stopping rootwrap daemon process with pid=70197

This is happening with stable mitaka and neutron

Revision history for this message
Esha Seth (eshaseth) wrote :

I am observing above issue in stable mitaka and newton - neutron

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Where is tag 3999 coming from? Was that assigned by the agent?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Are those ports actually on the same network? I can see in the restore logic that this key error could occur if the net_uuid in OVS for each port was different even though they have the same local vlan tag. This could be the result of another bug that resulted in two different networks getting the same local vlan.

Revision history for this message
Esha Seth (eshaseth) wrote :

The tag is coming from agent (3999 is assigned for untagged/flat network)
The ports are for 2 different vms deployed on same network. The network is a flat network.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you please provide the output of "sudo ovs-vsctl list Port" so I can see what each port shows for its net-uuid?

Revision history for this message
Esha Seth (eshaseth) wrote :
Download full text (4.3 KiB)

_uuid : 53116c80-a7ad-46db-8bdd-4f2257b95a55
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [b2d53a32-5e39-44b3-a52a-f4a526c0f81e]
lacp : []
mac : []
name : "enP3p9s0f0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks : []
vlan_mode : []

_uuid : 290eaaec-e884-4d5b-ac41-8b882559b523
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [2efa4ac6-c4f8-4e52-a63a-8c14b8dd9d14]
lacp : []
mac : []
name : "tap85564a42-ae"
other_config : {net_uuid="2a7ffdd1-f456-4bde-b339-3ff4c6c1d71f", network_ type=flat, physical_network="default0", tag="3999"}
qos : []
statistics : {}
status : {}
tag : 3999
trunks : []
vlan_mode : []

_uuid : 32b69084-ed4a-4333-a36d-6598fd598afd
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [c527424e-c5c0-4794-83f0-1fff183a92c9]
lacp : []
mac : []
name : "tap562a7922-11"
other_config : {net_uuid="cc6f4783-a495-40cc-a6e9-5548b667f989", network_ type=flat, physical_network="default0", tag="1"}
qos : []
statistics : {}
status : {}
tag : 3999
trunks : []
vlan_mode : []

_uuid : 2afcf0e1-5fce-43cb-8ae0-b5e990ede0e4
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [e626a593-788c-43d3-8478-1309c6632998]
lacp : []
mac : []
name : "phy-default0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks : []
vlan_mode : []

_uuid : 75730552-20fc-4f43-81bf-2040a84f6489
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [866c1021-aefa-4b43-9a0b-228774d53fc3]
lacp : []
mac : []
name : "default0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks ...

Read more...

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Based on that output:

tap85564a42-ae is on network 2a7ffdd1-f456-4bde-b339-3ff4c6c1d71f
tap562a7922-11 is on network cc6f4783-a495-40cc-a6e9-5548b667f989

Did you update the mysql database or something to change which network either was associated with?

Revision history for this message
Esha Seth (eshaseth) wrote :

No I did not.
The tag is coming from agent (3999 is assigned for untagged/flat network)
The ports are for 2 different vms deployed on 2 different flat networks at different times.

tags: added: ovs
Revision history for this message
Esha Seth (eshaseth) wrote :

Kevin Benton , I agree we are hitting the same bug you mentioned in comment 10. My comment 11 incorrectly mentioned that the ports were on same network. Comment 13 and 15 corroborate that the ports on the 2 vms were from 2 different flat networks but they had the same vlan tag (3999). That is why we could be hitting the keyerror during agent startup.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

@Esha,

Do you have a way to reproduce this? Unfortunately the issue occurred at some other point that corrupted the state (by giving the same vlan to two different networks). Were you running two copies of the L2 agent?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

You can of course fix this one-off instance of this by removing the local VLAN on tap562a7922-11 (the screwed up one).

sudo ovs-vsctl remove port tap562a7922-11 tag 3999

Revision history for this message
Esha Seth (eshaseth) wrote :

One network was created using one instance of neutron agent / ovs agent and the other was created using another instance at a later stage. Yes that is correct. So as both networks were flat networks thus the tags are same. This is the issue.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

What were the versions of the Neutron agents (earlier instance and later)?

Revision history for this message
Esha Seth (eshaseth) wrote :

The neutron agent versions were different. One used stable mitaka and the other used newton.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I wasn't able to reproduce this. Here are the steps I used:

1. start stable/mitaka OVS agent
2. boot a VM to flat1 (which is a network of type flat with physical_network = 'flat1')
3. stop the agent
4. start the stable/newton OVS agent
5. boot another VM to flat2 (which is a network of type flat with physical_network = 'flat2')

They both end up with different tags. I can reboot the agent as either stable/newton or stable/mitaka and it starts up fine with both VMs.

What was the order of events for you?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Also, I see that the prefix of the ports is 'tap' for the VMs instead of 'qvo' which means you aren't using the iptables firewall. Which firewall driver are you using?

Revision history for this message
Esha Seth (eshaseth) wrote :

I tried one more scenario in which I used one network (flat) and created 2 ports off it (2 vms) using mitaka ovs agent. The ports created had the same tag '2' and same net uuid. Then I used newton agent to create another one and it worked fine. So I am not seeing the issue now with same tag.

Changed in neutron:
status: New → Invalid
Revision history for this message
George Shuklin (george-shuklin) wrote :

We've got same issue after upgrading from liberty. It was really painful, and we've been forced to manually patch agent on hosts.

This is a real issue, please fix it.

Changed in neutron:
status: Invalid → New
sarvani konda (sarvani)
Changed in neutron:
assignee: nobody → sarvani konda (sarvani)
Revision history for this message
George Shuklin (george-shuklin) wrote :

JFYI: It had happened in neutron 8.3, latest mitaka release.

Revision history for this message
sarvani konda (sarvani) wrote :

Could you tell me the upgradation steps of ovs agent from mitaka to newton so that i can reproduce the issue.

Revision history for this message
George Shuklin (george-shuklin) wrote :

Hello.

It was Juno, which was upgraded to Kilo, which was upgraded to Liberty, which was upgraded to Mitaka.

During this process openvswitch-switch service was never restarted

Revision history for this message
sarvani konda (sarvani) wrote :

I have tried upgrading openstack from mitaka to newton version by following these links ​:

http://docs.openstack.org/ops-guide/ops-upgrades.html
https://wiki.ubuntu.com/OpenStack/CloudArchive
http://docs.openstack.org/mitaka/install-guide-ubuntu/environment-packages.html

​but I could not upgrade it successfully, could someone provide me the detail steps to upgrade openstack (mitaka to newton) on ubuntu.

Revision history for this message
George Shuklin (george-shuklin) wrote :

I don't know if I could help you with devstack upgrades. We've performed them one by one (J->K->L->M) in production environment under chef, and on each upgrade we've applied different version of configuration files (there is no 'compatible config' for all versions J through M).

Revision history for this message
sarvani konda (sarvani) wrote :

I tried with a scenario where I have created two different flat networks and launched Vm's associated to those networks.I have observed that the Vm's are getting different vlan tags.
I placed PDB inside the code to check how vlan tags are being associated, for instance1 the vlan tag associated is '5' and for instance2 It is '6', which is an expected behavior.

I restarted the ovs agent and checked, It is getting the different vlan tags. Also I didn't find any snippet in ovs neutron agent file, where two different networks gets associated with same vlan tags.
We tried upgrading from mitaka to newton but didn't get any proper upgradation steps to be followed. So the recreation what we tried is only based on the Code understanding and manual verification of the scenario with agent restarted without validating with the upgrade process.

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

I was able to reproduce the issue on master devstack (d7e7dd451aca77a19f5c101f8367c18cdbcac904
, multinode environment with DVR, 1 all-in-1 controller (dvr_snat), 2 computes (dvr)).

Steps to reproduce:
1. create a network/subnet, boot a vm.
2. try to restart openvswitch-agent

Upon adding some printf logging it seems that sg- port, qr- port of the tenant network and vm port all have local_vlan tag set to the same value. So second attempt to remove it fails with KeyError.

Here are the logs - http://paste.openstack.org/show/616060/

Revision history for this message
yangjianfeng (yangjianfeng) wrote :
Download full text (4.6 KiB)

The bug is also reproduced in our environment.

2018-12-18 18:43:33.807 101283 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [req-f4683b4e-3437-4d13-b0ea-63b7a065ef5d - - - - -] Agent main thread died of an exception: KeyError: 1
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp Traceback (most recent call last):
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 40, in agent_main_wrapper
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_agent.main(bridge_classes)
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2272, in main
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp agent = OVSNeutronAgent(bridge_classes, ext_mgr, cfg.CONF)
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 241, in __init__
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self._restore_local_vlan_map()
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 358, in _restore_local_vlan_map
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self.available_local_vlans.remove(local_vlan)
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp KeyError: 1
2018-12-18 18:43:33.807 101283 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp
2018-12-18 18:43:33.810 101283 CRITICAL neutron [-] Unhandled error: KeyError: 1
2018-12-18 18:43:33.810 101283 ERROR neutron Traceback (most recent call last):
2018-12-18 18:43:33.810 101283 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 10, in <module>
2018-12-18 18:43:33.810 101283 ERROR neutron sys.exit(main())
2018-12-18 18:43:33.810 101283 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/cmd/eventlet/plugins/ovs_neutron_agent.py", line 20, in main
2018-12-18 18:43:33.810 101283 ERROR neutron agent_main.main()
2018-12-18 18:43:33.810 101283 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/main.py", line 47, in main
2018-12-18 18:43:33.810 10...

Read more...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

@yangjianfeng: can You tell on what version it happens for You?

Revision history for this message
yangjianfeng (yangjianfeng) wrote :

@Slawek Kaplonski: queens

Revision history for this message
weisongf (songwei-8) wrote :

We also ran into this issue in the Rocky version.(neutron:13.0.6)

Revision history for this message
weisongf (songwei-8) wrote (last edit ):

/var/ 1ib/openstack/ lib/python3. 6/site-packages /neutron/plugins /ml2/drivers /openvswitch/agent/ovs_neutron_ agent.py", line 386,in.restore-local_vlan_map
self.available_ local_vlans. remove(local. _vlan)
KeyError: 107
2021-12-1013:37 :54.155 1 ERROR neutron.plugins .m12. drivers. openvswi tch. agent . openflow. native.ovs_ ryuapp [req -1b44 fda8-7f39-4d4f-9b55-b3b3-20ca6f39]
Agent main thread died of an excepti on: KeyError: 107

Revision history for this message
Brian Haley (brian-haley) wrote :

Rocky is EOL, can you reproduce this on a recent release? Something like Xena?

Changed in neutron:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.