neutron-openvswitch-agent is crashing due to KeyError in _restore_local_vlan_map()

Bug #1625305 reported by Esha Seth on 2016-09-19
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
High
sarvani konda

Bug Description

Neutron openvswitch agent is unable to restart because vms with untagged/flat networks (tagged 3999) cause issue with _restore_local_vlan_map

Loaded agent extensions: []
2016-09-06 07:57:39.682 70085 CRITICAL neutron [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron Traceback (most recent call last):
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 28, in <module>
2016-09-06 07:57:39.682 70085 ERROR neutron sys.exit(main())
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 235, in __init__
2016-09-06 07:57:39.682 70085 ERROR neutron self._restore_local_vlan_map()
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 356, in _restore_local_vlan_map
2016-09-06 07:57:39.682 70085 ERROR neutron self.available_local_vlans.remove(local_vlan)
2016-09-06 07:57:39.682 70085 ERROR neutron KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron
2016-09-06 07:57:39.684 70085 INFO oslo_rootwrap.client [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] Stopping rootwrap daemon process with pid=70197

Esha Seth (eshaseth) on 2016-09-19
summary: neutron-openvswitch-agent is crashing due to KeyError in
- ._restore_local_vlan_map()
+ _restore_local_vlan_map()
description: updated
Brian Haley (brian-haley) wrote :

Looks like a duplicate of https://bugs.launchpad.net/neutron/+bug/1526974 - please see that bug for reference and fixed sent to all the releases.

Jakub Libosvar (libosvar) wrote :

Could be a regression caused by https://review.openstack.org/#/c/349990/ ?

Esha Seth (eshaseth) wrote :

I am still seeing the issue in neutron 9.0.0.0 (newton) and also in stable mitaka. So either 1526974 should be reopened or fix persued here.

Brian Haley (brian-haley) wrote :

I removed the duplicate since it could be a new issue after the change Ihar mentioned.

@Esha: can you provide more detailed steps to repro?

Changed in neutron:
importance: Undecided → High
tags: added: newton-rc-potential

I can't see how could be a culprit though.

_restore_local_vlan_map() which is called during init was not even touched by [1]

[1] https://review.openstack.org/#/c/349990/

Esha Seth (eshaseth) wrote :

These were the 2 vms I had running on the host.

[root@ip9-114-248-165 ibm]# virsh list --all
 Id Name State
----------------------------------------------------
 5 RHEL73-Jag-1a1cb620-0000000a running
 8 rhel_jag1-0f67d94d-00000002 running

 [root@ip9-114-248-165 ~]# ovs-vsctl show
6fd4b576-698d-41bc-b226-241145207524
    Bridge br-int
        fail_mode: secure
        Port "tap85564a42-ae"
            tag: 3999
            Interface "tap85564a42-ae"
        Port "tap562a7922-11"
            tag: 3999
            Interface "tap562a7922-11"
        Port br-int
            Interface br-int
                type: internal
        Port "int-default0"
            Interface "int-default0"
                type: patch
                options: {peer="phy-default0"}
    Bridge "default0"
        fail_mode: secure
        Port "phy-default0"
            Interface "phy-default0"
                type: patch
                options: {peer="int-default0"}
        Port "enP3p9s0f0"
            Interface "enP3p9s0f0"
        Port "default0"
            Interface "default0"
                type: internal
    ovs_version: "2.3.1-git3282e51"

These 2 were based on a flat network.

Then the KVM host was to be used again to add more networks.

Neutron openvswitch agent is unable to restart because vms with untagged/flat networks (tagged 3999) cause issue with _restore_local_vlan_map

Loaded agent extensions: []
2016-09-06 07:57:39.682 70085 CRITICAL neutron [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron Traceback (most recent call last):
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/bin/neutron-openvswitch-agent", line 28, in <module>
2016-09-06 07:57:39.682 70085 ERROR neutron sys.exit(main())
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 235, in __init__
2016-09-06 07:57:39.682 70085 ERROR neutron self._restore_local_vlan_map()
2016-09-06 07:57:39.682 70085 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 356, in _restore_local_vlan_map
2016-09-06 07:57:39.682 70085 ERROR neutron self.available_local_vlans.remove(local_vlan)
2016-09-06 07:57:39.682 70085 ERROR neutron KeyError: 3999
2016-09-06 07:57:39.682 70085 ERROR neutron
2016-09-06 07:57:39.684 70085 INFO oslo_rootwrap.client [req-ef8eea4f-c1ed-47a0-8318-eb5473b7c667 - - - - -] Stopping rootwrap daemon process with pid=70197

This is happening with stable mitaka and neutron

Esha Seth (eshaseth) wrote :

I am observing above issue in stable mitaka and newton - neutron

Kevin Benton (kevinbenton) wrote :

Where is tag 3999 coming from? Was that assigned by the agent?

Kevin Benton (kevinbenton) wrote :

Are those ports actually on the same network? I can see in the restore logic that this key error could occur if the net_uuid in OVS for each port was different even though they have the same local vlan tag. This could be the result of another bug that resulted in two different networks getting the same local vlan.

Esha Seth (eshaseth) wrote :

The tag is coming from agent (3999 is assigned for untagged/flat network)
The ports are for 2 different vms deployed on same network. The network is a flat network.

Kevin Benton (kevinbenton) wrote :

Can you please provide the output of "sudo ovs-vsctl list Port" so I can see what each port shows for its net-uuid?

Esha Seth (eshaseth) wrote :
Download full text (4.3 KiB)

_uuid : 53116c80-a7ad-46db-8bdd-4f2257b95a55
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [b2d53a32-5e39-44b3-a52a-f4a526c0f81e]
lacp : []
mac : []
name : "enP3p9s0f0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks : []
vlan_mode : []

_uuid : 290eaaec-e884-4d5b-ac41-8b882559b523
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [2efa4ac6-c4f8-4e52-a63a-8c14b8dd9d14]
lacp : []
mac : []
name : "tap85564a42-ae"
other_config : {net_uuid="2a7ffdd1-f456-4bde-b339-3ff4c6c1d71f", network_ type=flat, physical_network="default0", tag="3999"}
qos : []
statistics : {}
status : {}
tag : 3999
trunks : []
vlan_mode : []

_uuid : 32b69084-ed4a-4333-a36d-6598fd598afd
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [c527424e-c5c0-4794-83f0-1fff183a92c9]
lacp : []
mac : []
name : "tap562a7922-11"
other_config : {net_uuid="cc6f4783-a495-40cc-a6e9-5548b667f989", network_ type=flat, physical_network="default0", tag="1"}
qos : []
statistics : {}
status : {}
tag : 3999
trunks : []
vlan_mode : []

_uuid : 2afcf0e1-5fce-43cb-8ae0-b5e990ede0e4
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [e626a593-788c-43d3-8478-1309c6632998]
lacp : []
mac : []
name : "phy-default0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks : []
vlan_mode : []

_uuid : 75730552-20fc-4f43-81bf-2040a84f6489
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [866c1021-aefa-4b43-9a0b-228774d53fc3]
lacp : []
mac : []
name : "default0"
other_config : {}
qos : []
statistics : {}
status : {}
tag : []
trunks ...

Read more...

Kevin Benton (kevinbenton) wrote :

Based on that output:

tap85564a42-ae is on network 2a7ffdd1-f456-4bde-b339-3ff4c6c1d71f
tap562a7922-11 is on network cc6f4783-a495-40cc-a6e9-5548b667f989

Did you update the mysql database or something to change which network either was associated with?

Esha Seth (eshaseth) wrote :

No I did not.
The tag is coming from agent (3999 is assigned for untagged/flat network)
The ports are for 2 different vms deployed on 2 different flat networks at different times.

tags: added: ovs
Esha Seth (eshaseth) wrote :

Kevin Benton , I agree we are hitting the same bug you mentioned in comment 10. My comment 11 incorrectly mentioned that the ports were on same network. Comment 13 and 15 corroborate that the ports on the 2 vms were from 2 different flat networks but they had the same vlan tag (3999). That is why we could be hitting the keyerror during agent startup.

Kevin Benton (kevinbenton) wrote :

@Esha,

Do you have a way to reproduce this? Unfortunately the issue occurred at some other point that corrupted the state (by giving the same vlan to two different networks). Were you running two copies of the L2 agent?

Kevin Benton (kevinbenton) wrote :

You can of course fix this one-off instance of this by removing the local VLAN on tap562a7922-11 (the screwed up one).

sudo ovs-vsctl remove port tap562a7922-11 tag 3999

Esha Seth (eshaseth) wrote :

One network was created using one instance of neutron agent / ovs agent and the other was created using another instance at a later stage. Yes that is correct. So as both networks were flat networks thus the tags are same. This is the issue.

Kevin Benton (kevinbenton) wrote :

What were the versions of the Neutron agents (earlier instance and later)?

Esha Seth (eshaseth) wrote :

The neutron agent versions were different. One used stable mitaka and the other used newton.

Kevin Benton (kevinbenton) wrote :

I wasn't able to reproduce this. Here are the steps I used:

1. start stable/mitaka OVS agent
2. boot a VM to flat1 (which is a network of type flat with physical_network = 'flat1')
3. stop the agent
4. start the stable/newton OVS agent
5. boot another VM to flat2 (which is a network of type flat with physical_network = 'flat2')

They both end up with different tags. I can reboot the agent as either stable/newton or stable/mitaka and it starts up fine with both VMs.

What was the order of events for you?

Kevin Benton (kevinbenton) wrote :

Also, I see that the prefix of the ports is 'tap' for the VMs instead of 'qvo' which means you aren't using the iptables firewall. Which firewall driver are you using?

Esha Seth (eshaseth) wrote :

I tried one more scenario in which I used one network (flat) and created 2 ports off it (2 vms) using mitaka ovs agent. The ports created had the same tag '2' and same net uuid. Then I used newton agent to create another one and it worked fine. So I am not seeing the issue now with same tag.

Changed in neutron:
status: New → Invalid

We've got same issue after upgrading from liberty. It was really painful, and we've been forced to manually patch agent on hosts.

This is a real issue, please fix it.

Changed in neutron:
status: Invalid → New
sarvani konda (sarvani) on 2017-01-23
Changed in neutron:
assignee: nobody → sarvani konda (sarvani)

JFYI: It had happened in neutron 8.3, latest mitaka release.

sarvani konda (sarvani) wrote :

Could you tell me the upgradation steps of ovs agent from mitaka to newton so that i can reproduce the issue.

Hello.

It was Juno, which was upgraded to Kilo, which was upgraded to Liberty, which was upgraded to Mitaka.

During this process openvswitch-switch service was never restarted

sarvani konda (sarvani) wrote :

I have tried upgrading openstack from mitaka to newton version by following these links ​:

http://docs.openstack.org/ops-guide/ops-upgrades.html
https://wiki.ubuntu.com/OpenStack/CloudArchive
http://docs.openstack.org/mitaka/install-guide-ubuntu/environment-packages.html

​but I could not upgrade it successfully, could someone provide me the detail steps to upgrade openstack (mitaka to newton) on ubuntu.

I don't know if I could help you with devstack upgrades. We've performed them one by one (J->K->L->M) in production environment under chef, and on each upgrade we've applied different version of configuration files (there is no 'compatible config' for all versions J through M).

sarvani konda (sarvani) wrote :

I tried with a scenario where I have created two different flat networks and launched Vm's associated to those networks.I have observed that the Vm's are getting different vlan tags.
I placed PDB inside the code to check how vlan tags are being associated, for instance1 the vlan tag associated is '5' and for instance2 It is '6', which is an expected behavior.

I restarted the ovs agent and checked, It is getting the different vlan tags. Also I didn't find any snippet in ovs neutron agent file, where two different networks gets associated with same vlan tags.
We tried upgrading from mitaka to newton but didn't get any proper upgradation steps to be followed. So the recreation what we tried is only based on the Code understanding and manual verification of the scenario with agent restarted without validating with the upgrade process.

I was able to reproduce the issue on master devstack (d7e7dd451aca77a19f5c101f8367c18cdbcac904
, multinode environment with DVR, 1 all-in-1 controller (dvr_snat), 2 computes (dvr)).

Steps to reproduce:
1. create a network/subnet, boot a vm.
2. try to restart openvswitch-agent

Upon adding some printf logging it seems that sg- port, qr- port of the tenant network and vm port all have local_vlan tag set to the same value. So second attempt to remove it fails with KeyError.

Here are the logs - http://paste.openstack.org/show/616060/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments