OVS agent may die if a (retriable) exception is raised in its __init__

Bug #1534110 reported by Miguel Angel Ajo on 2016-01-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
YAMAMOTO Takashi

Bug Description

Probably we should provide a reconnection mechanism when something on the OpenFlow connection goes wrong.

2016-01-06 08:23:45.031 11755 DEBUG OfctlService [-] dpid 231386065181514 -> datapath None _handle_get_datapath /opt/stack/new/neutron/.tox/dsvm-fullstack-constraints/local/lib/python2.7/site-packages/ryu/app/ofctl/service.py:106
2016-01-06 08:23:45.032 11755 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [-] Switch connection timeout
2016-01-06 08:23:45.033 11755 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', 'ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=datapath_id', 'list', 'Bridge', 'br-int261889006'] create_process /opt/stack/new/neutron/neutron/agent/linux/utils.py:84
2016-01-06 08:23:45.057 11755 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /opt/stack/new/neutron/neutron/agent/linux/utils.py:142
2016-01-06 08:23:45.058 11755 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Switch connection timeout Agent terminated!
2016-01-06 08:23:45.060 11755 ERROR ryu.lib.hub [-] hub: uncaught exception: Traceback (most recent call last):
  File "/opt/stack/new/neutron/.tox/dsvm-fullstack-constraints/local/lib/python2.7/site-packages/ryu/lib/hub.py", line 52, in _launch
    func(*args, **kwargs)
  File "/opt/stack/new/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1991, in main
    sys.exit(1)
SystemExit: 1

http://logs.openstack.org/77/240577/6/check/gate-neutron-dsvm-fullstack/ec699c7/logs/TestConnectivitySameNetwork.test_connectivity_VLANs,Native_/neutron-openvswitch-agent--2016-01-06--08-23-13-672140.log.txt.gz#_2016-01-06_08_23_45_032

http://logs.openstack.org/77/240577/6/check/gate-neutron-dsvm-fullstack/ec699c7/testr_results.html.gz

Tags: ovs Edit Tag help
tags: added: fullstack
John Trowbridge (trown) wrote :
Download full text (10.5 KiB)

We have observed a similar issue when testing TripleO in a resource constrained environment (All VMs on a 32G RAM virthost). In this environment we often see commands timing out and needing to be retried, and it is not unsual for RPC connections to timeout as well.

When using the OF native connection, if RPC times out, we lose open connections as well. This is happening on the undercloud sometimes during the stage where Ironic is dd'ing the image to the overcloud nodes which results in a deploy failure.

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal-411/undercloud/var/log/neutron/openvswitch-agent.log.gz

2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Failed reporting state!
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 322, in _report_state
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent True)
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return method(context, 'report_state', **kwargs)
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 157, in call
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent time.sleep(wait)
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.force_reraise()
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 138, in call
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return self._original_context.call(ctxt, method, **kwargs)
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
2016-07-26 10:49:34.475 17287 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neu...

Miguel Angel Ajo (mangelajo) wrote :

Moved this to high after talking to @iharchys, since we're not using the native implementation as the default.

Changed in neutron:
importance: Medium → High
status: New → Triaged
Alan Pevec (apevec) wrote :

" we're not using " -> we are NOW using ?

I don't think it affects fullstack specifically, removing the tag.

tags: removed: fullstack

Otherwise, the bug probably still affects us: if OVSNeutronAgent.__init__ raises a RuntimeError (or any other exception), then the agent is terminated, instead of retry. Once in the daemon_loop, we correctly catch all Exceptions, which doesn't result in agent dying. Instead we resync, which is correct.

To fix this bug, we would probably need to make sure that if __init__ raises a specific retriable exception, we retry initialization. Note that we probably don't want to retry every single exception because some of them are not recoverable, and hence we would then get in a infinite loop.

Also, I don't think only native implementation is affected. Anything in __init__ that raises an exception can bring the agent down.

summary: - OF native connection sometimes goes away and agent exits
+ OVS agent may die if a (retriable) exception is raised in its __init__
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers