neutron-openvswitch-agent is crashing with "invalid literal for int() with base 10" error

Bug #1494281 reported by bharath on 2015-09-10
70
This bug affects 13 people
Affects Status Importance Assigned to Milestone
neutron
High
Thomas Herve
Liberty
High
Armando Migliaccio

Bug Description

neutron-openvswitch-agent is crashing with below error

2015-09-10 04:39:36.675 DEBUG neutron.agent.linux.utils [req-a6c70c4e-aa40-44e4-bd09-493e82bfe43c None None]
Command: ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=name,other_config,tag', 'list', 'Port', u'tap8e259da4-e8']
Exit code: 0
 from (pid=26026) execute /opt/stack/neutron/neutron/agent/linux/utils.py:157
2015-09-10 04:39:36.675 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-a6c70c4e-aa40-44e4-bd09-493e82bfe43c None None] invalid literal for int() with base 10: 'None' Agent terminated!
2015-09-10 04:39:36.677 INFO oslo_rootwrap.client [req-a6c70c4e-aa40-44e4-bd09-493e82bfe43c None None] Stopping rootwrap daemon process with pid=26080

I suspect commit "Implement external physical bridge mapping in linuxbridge" causing the breakage. [commit-id: bd734811753a99d61e30998c734e465a8d507b8f]

When i set the branch back to b6d780a83cd9a811e8a91db77eb24bb65fa0b075 commit , issue is not seen.

Tags: ovs Edit Tag help
Assaf Muller (amuller) wrote :

1) Can you use pastebin to paste a full trace?
2) Since Tempest is running on every patch and successfully starting the OVS agent I assume this is a configuration issue, or some other local issue
3) The commit you linked to doesn't change anything in the OVS agent, or code that the OVS agent uses. Can you check if it's actually another commit that caused the issue?

Changed in neutron:
status: New → Incomplete
Naohiro Tamura (naohirot) wrote :

When I created devstack environment to test Ironic, I got exactly same error.

2015-09-14 14:58:59.044 DEBUG neutron.agent.linux.utils [req-59679578-c564-488f-8331-c8f61e2e5c0b None None] Running command (rootwrap daemon): ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=name,other_config,tag', 'list', 'Port', 'tapea890f2b-7c'] from (pid=29175) execute_rootwrap_daemon /opt/stack/neutron/neutron/agent/linux/utils.py:102
2015-09-14 14:58:59.091 DEBUG neutron.agent.linux.utils [req-59679578-c564-488f-8331-c8f61e2e5c0b None None]
Command: ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=name,other_config,tag', 'list', 'Port', u'tapea890f2b-7c']
Exit code: 0
 from (pid=29175) execute /opt/stack/neutron/neutron/agent/linux/utils.py:157
2015-09-14 14:58:59.092 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-59679578-c564-488f-8331-c8f61e2e5c0b None None] invalid literal for int() with base 10: 'None' Agent terminated!
2015-09-14 14:58:59.109 INFO oslo_rootwrap.client [req-59679578-c564-488f-8331-c8f61e2e5c0b None None] Stopping rootwrap daemon process with pid=29343

bharath (bharath-7) wrote :

I Found the root cause , tested the patch locally . Its fine

Root cause:

In class OVSNeutronAgent() (File: /neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py)

                 self.provision_local_vlan(local_vlan_map['net_uuid'],
                                           local_vlan_map['network_type'],
                                           local_vlan_map['physical_network'],
                                           int(local_vlan_map[
                                               'segmentation_id']),
                                           local_vlan)

Integer conversion of segmentation_id is causing the problem.
in some cases segmentation_id is none, which is causing segmentation_id is none. which is causing exception

I submitting patchset today. Please provide the comments

Edgar Magana (emagana) on 2015-09-16
Changed in neutron:
importance: Undecided → Medium

bharath, if you are planning on submitting a patch, please assign this bug to yourself.

Edgar Magana (emagana) wrote :

We need to get this fixed ASAP.. I am submitting a quick fix, feel free to overwrite my commit.

Fix proposed to branch: master
Review: https://review.openstack.org/224297

Changed in neutron:
assignee: nobody → Edgar Magana (emagana)
status: Incomplete → In Progress
Changed in neutron:
status: In Progress → Incomplete
Edgar Magana (emagana) wrote :

Adding steps to reproduce. Basically:

1. git clone http://github.com/openstack-dev/devstack.git
2. edit localrc to enable neutron
3. ./stack.sh
4. screen -x and switch to q-agt screen
5. stop the process
6. Restart the process and error will be:

 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-436ba1a2-0a75-4e40-bfc0-a821912aa882 None None] invalid literal for int() with base 10: 'None' Agent terminated!

I can't reproduce the issues with these steps. The agent doesn't crash and it doesn't long any error. What's your ml2 config look like?

Edgar Magana (emagana) wrote :

[ml2]
tenant_network_type = vxlan
extension_drivers = port_security
type_drivers = local,flat,vlan,gre,vxlan
mechanism_drivers = openvswitch,linuxbridge

[ml2_type_gre]
tunnel_id_ranges = 50:100

[ml2_type_vxlan]
vni_ranges = 1001:2000

[securitygroup]
firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver

[agent]
tunnel_types = vxlan
root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf

[ovs]
tunnel_bridge = br-tun
local_ip = 172.16.175.128

Edgar Magana (emagana) wrote :

Adding my localrc file:

enable_plugin rally https://github.com/openstack/rally master
DATABASE_PASSWORD=nova
RABBIT_PASSWORD=nova
SERVICE_TOKEN=nova
SERVICE_PASSWORD=nova
ADMIN_PASSWORD=nova

HOST_IP=<VM IP>
MULTI_HOST=1
disable_service n-net
enable_service n-cpu
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service neutron
Q_PLUGIN=ml2
TENANT_TUNNEL_RANGES=50:100
ENABLE_TENANT_TUNNELS=True
ENABLE_TENANT_TUNNELS=True
Q_AGENT_EXTRA_AGENT_OPTS=(tunnel_type=vxlan)
Q_AGENT_EXTRA_OVS_OPTS=(tenant_network_type=vxlan)
Q_SRV_EXTRA_OPTS=(tenant_network_type=vxlan)
Q_USE_NAMESPACE=True
Q_USE_SECGROUP=True
Q_SERVICE_PLUGIN_CLASSES=qos
LIBS_FROM_GIT=python-ceilometerclient,python-cinderclient,python-glanceclient,python-heatclient,python-ironicclient,python-keystoneclient,python-neutronclient,python-novaclient,python-saharaclient
,python-swiftclient,python-troveclient,python-openstackclient,oslo.config

Kasey Alusi (kasey-alusi) wrote :

I reproduced the same error:
2015-09-17 14:33:32.223 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c073fc3e-a13d-4f00-9a4d-6836f7b23dca None None] invalid literal for int() with base 10: 'None' Agent terminated!

My localrc:
enable_plugin rally https://github.com/openstack/rally master
DATABASE_PASSWORD=nova
RABBIT_PASSWORD=nova
SERVICE_TOKEN=nova
SERVICE_PASSWORD=nova
ADMIN_PASSWORD=nova

HOST_IP=10.0.2.15
MULTI_HOST=1
disable_service n-net
enable_service n-cpu
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service neutron
Q_PLUGIN=ml2
TENANT_TUNNEL_RANGES=50:100
ENABLE_TENANT_TUNNELS=True
ENABLE_TENANT_TUNNELS=True
Q_AGENT_EXTRA_AGENT_OPTS=(tunnel_type=vxlan)
Q_AGENT_EXTRA_OVS_OPTS=(tenant_network_type=vxlan)
Q_SRV_EXTRA_OPTS=(tenant_network_type=vxlan)
Q_USE_NAMESPACE=True
Q_USE_SECGROUP=True
Q_SERVICE_PLUGIN_CLASSES=qos
LIBS_FROM_GIT=python-cinderclient,python-glanceclient,python-heatclient,python-keystoneclient,python-neutronclient,python-novaclient,python-swiftclient,python-openstackclient,oslo.config

Edgar Magana (emagana) wrote :

It seems that this issue is based on a typo in the localrc file that I have been using:

Q_AGENT_EXTRA_OVS_OPTS=(tenant_network_type=vxlan)
Q_SRV_EXTRA_OPTS=(tenant_network_type=vxlan)

should be:

Q_AGENT_EXTRA_OVS_OPTS=(tenant_network_types=vxlan)
Q_SRV_EXTRA_OPTS=(tenant_network_types=vxlan)

Change abandoned by Edgar Magana (<email address hidden>) on branch: master
Review: https://review.openstack.org/224297
Reason: The original bug was reproducible based on a typo in the localrc for devstack. After fixing that, the bug is not reproducible anymore.

Fix proposed to branch: master
Review: https://review.openstack.org/225001

Changed in neutron:
assignee: Edgar Magana (emagana) → bharath (bharath-7)
status: Incomplete → In Progress
bharath (bharath-7) wrote :

Sorry for delay response, been out of station

I am still facing the issue. I don't have typo in my local rc file.

stack@ci-jslave-base:/opt/stack/devstack$ rgrep "tenant_network_type"
lib/neutron_plugins/ml2: Q_SRV_EXTRA_OPTS+=(tenant_network_types=$Q_ML2_TENANT_NETWORK_TYPE)
lib/neutron_plugins/ml2: Q_SRV_EXTRA_OPTS+=(tenant_network_types=gre)
lib/neutron_plugins/ml2: Q_SRV_EXTRA_OPTS+=(tenant_network_types=vlan)
lib/neutron_plugins/openvswitch: iniset /$Q_PLUGIN_CONF_FILE ovs tenant_network_type gre
lib/neutron_plugins/openvswitch: iniset /$Q_PLUGIN_CONF_FILE ovs tenant_network_type vlan
lib/neutron_thirdparty/vyatta: iniset "/$Q_PLUGIN_CONF_FILE" ml2 tenant_network_types "local,flat,vlan,gre,vxlan"

stack@ci-jslave-base:/etc/neutron$ rgrep "tenant_network_type"
plugins/ml2/ml2_conf.ini:tenant_network_types = local,flat,vlan,gre,vxlan
plugins/ml2/ml2_conf.ini:# tenant_network_types = local
plugins/ml2/ml2_conf.ini:# Example: tenant_network_types = vlan,gre,vxlan,geneve
neutron.conf:# By default or if empty, the first 'tenant_network_types'

plugins/ml2/ml2_conf.ini:# Example: tenant_network_types = vlan,gre,vxlan,geneve
neutron.conf:# By default or if empty, the first 'tenant_network_types'

Even the changeset which i tried is same as edgar proposed. Can we push the changeset.
https://review.openstack.org/225001

bharath (bharath-7) wrote :

I can see some references to tenant_network_type under devstack as below

devstack/lib/neutron_plugins/openvswitch: iniset /$Q_PLUGIN_CONF_FILE ovs tenant_network_type gre
devstack/lib/neutron_plugins/openvswitch: iniset /$Q_PLUGIN_CONF_FILE ovs tenant_network_type vlan

is above lines valid?

bharath (bharath-7) wrote :

Can we go head with safe fix?

Miguel Angel Ajo (mangelajo) wrote :

ping @edgar, What was the localrc type triggering the issue?

Miguel Angel Ajo (mangelajo) wrote :

type->typo

Edgar Magana (emagana) wrote :

@bharath - Pelase, paste your entire localrc file

Edgar Magana (emagana) on 2015-09-27
Changed in neutron:
status: In Progress → Incomplete
bharath (bharath-7) wrote :

This the vlan_map during error condition

local_vlan_map:
{u'segmentation_id': u'None', u'physical_network': u'mng', u'net_uuid': u'6896f53b-4098-4f5e-8353-7fcdce775d96', u'network_type': u'flat'}

Changed in neutron:
status: Incomplete → In Progress
Boris Derzhavets (bderzhavets) wrote :

Attempt to test provider external networks with ml2_conf.ini

[ml2]
type_drivers = local,flat,gre,vxlan
tenant_network_types = vxlan
mechanism_drivers =openvswitch
path_mtu = 0
[ml2_type_flat]
flat_networks = *
[ml2_type_vlan]
[ml2_type_gre]
[ml2_type_vxlan]
vni_ranges =1001:2000
vxlan_group =239.1.1.2
[ml2_type_geneve]
[securitygroup]
enable_security_group = True

Results same error , patch https://review.openstack.org/225001 eliminates the problem with
restart neutron-openvswitch-agent.service

Dan Radez (dradez) wrote :

I'm also hitting this bug, installed through RDO package openstack-neutron-openvswitch-7.0.0.0-rc2.dev26.el7.centos.noarch

in my code the segmentation_id defaulted to None and not 'None'

I tried updating them to default to 0 and this seems to have worked around the issue. Is there something I could test to help find where the issue is coming from?

Thomas Herve (therve) wrote :

Hitting this. The ovs db is having that information stored:

$ sudo ovs-vsctl --timeout=10 --oneline --format=json -- --columns=name,other_config,tag list Port tap054bfd3c-ec
{"data":[["tap054bfd3c-ec",["map",[["net_uuid","a71fc5f5-686b-42a6-90c6-5c0d6825da7a"],["network_type","flat"],["physical_network","ctlplane"],["segmentation_id","None"]]],1]],"headings":["name","other_config","tag"]}

Any workaround for now?

Thomas Herve (therve) wrote :

And from the database:

select * from ml2_network_segments;
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+
| id | network_id | network_type | physical_network | segmentation_id | is_dynamic | segment_index |
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+
| a3aadbf4-92cf-4fd8-ac50-66e27cc6f956 | a71fc5f5-686b-42a6-90c6-5c0d6825da7a | flat | ctlplane | NULL | 0 | 0 |
+--------------------------------------+--------------------------------------+--------------+------------------+-----------------+------------+---------------+

Thomas Herve (therve) wrote :

Sorry it's unreadable, but it says segmentation_id is NULL in ml2_network_segments.

Thomas Herve (therve) wrote :

I believe the problem has been surfaced here: https://review.openstack.org/#/c/153946/ where the int() call has been added. It certainly doesn't take into account the fact that segmentation_id can be None.

But, it seems segmentation_id has been stored as the 'None' string' since forever in the ovs db. AFAIK there is no way to express None when we talk to the ovs db, so we should omit the key in port_bound when setting the other_config attribute. It means handling the fact it's not present when retrieving it in _restore_local_vlan_map.

Thoughts?

I suspect it breaks flat network setups. Raising priority.

tags: added: liberty-backport-potential ovs
Changed in neutron:
importance: Medium → High

When I create a flat network and bind an instance to it, I get the following entry in ovs-vsctl ports table:

_uuid : 26edb57f-5879-4424-a064-444d5e2d123e
bond_active_slave : []
bond_downdelay : 0
bond_fake_iface : false
bond_mode : []
bond_updelay : 0
external_ids : {}
fake_bridge : false
interfaces : [7685f978-6c33-47c4-b777-f86f8d6735d6]
lacp : []
mac : []
name : "qvoded7af94-bb"
other_config : {net_uuid="ecb6c0e8-e2e9-4c38-8a2e-a51269dff5ee", network_type=flat, physical_network="eth0", segmentation_id=None}
qos : []
statistics : {}
status : {}
tag : 2
trunks : []
vlan_mode : []

Note segmentation_id=='None'.

Hence the failure.

Fix proposed to branch: master
Review: https://review.openstack.org/237586

Changed in neutron:
assignee: bharath (bharath-7) → Thomas Herve (therve)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/225001
Reason: superseded by:

https://review.openstack.org/#/c/237586/

The new one has a test too and seems to have more momentum?

Reviewed: https://review.openstack.org/237586
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=51f6b2e1c9c2f5f5106b9ae8316e57750f09d7c9
Submitter: Jenkins
Branch: master

commit 51f6b2e1c9c2f5f5106b9ae8316e57750f09d7c9
Author: Thomas Herve <email address hidden>
Date: Tue Oct 20 15:42:59 2015 +0200

    Properly handle segmentation_id in OVS agent

    The segmentation_id of a OVS VLAN can be None, but a recent change
    assumed that it was always an integer. It highlighted the fact that we
    try to store None in the OVS database, which got stored as a string.
    This fixes the storage, and handles loading the value while keeping
    compatibility.

    Change-Id: I6e7df1406c90ddde254467bb87ff1507a4caaadd
    Closes-Bug: #1494281

Changed in neutron:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/238485
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=10e07503524cc244d5c8f1f285db4a4f06dd12e7
Submitter: Jenkins
Branch: stable/liberty

commit 10e07503524cc244d5c8f1f285db4a4f06dd12e7
Author: Thomas Herve <email address hidden>
Date: Tue Oct 20 15:42:59 2015 +0200

    Properly handle segmentation_id in OVS agent

    The segmentation_id of a OVS VLAN can be None, but a recent change
    assumed that it was always an integer. It highlighted the fact that we
    try to store None in the OVS database, which got stored as a string.
    This fixes the storage, and handles loading the value while keeping
    compatibility.

    Change-Id: I6e7df1406c90ddde254467bb87ff1507a4caaadd
    Closes-Bug: #1494281
    (cherry picked from commit 51f6b2e1c9c2f5f5106b9ae8316e57750f09d7c9)

tags: removed: liberty-backport-potential
Morteza Parsa (mpa360) wrote :

A workaround for emergencies :

Fix DB :
mysql --user=root
use neutron;
select * from ml2_network_segments;
update ml2_network_segments set segmentation_id = 0 where id = 'edaff05e-2e1d-47bb-99e4-08d11e702121';

Fix OVS :
ovs-vsctl show
ovs-vsctl --columns=other_config list Port tap35e898e4-ac
ovs-vsctl set Port tap35e898e4-ac other_config:segmentation_id=0

It works for me.

Please release related bug fix as soon as possible. I thinks the importance is critical, because the flat network setup not working.
Thanks

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released

This issue was fixed in the openstack/neutron 7.0.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers