openvswitch is sometimes restarting or times out, and this causes neutron-openvswitch-agent to crash

Bug #1296202 reported by Clint Byrum
38
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Robin Wang

Bug Description

This was observed recently durin an Ironic CI run:

2014-03-22 02:38:53.436 7916 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2014-03-22 02:38:53.690 7916 ERROR neutron.agent.linux.ovs_lib [-] Unable to execute ['ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']. Exception:
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
Exit code: 1
Stdout: ''
Stderr: '2014-03-22T02:38:53Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'
2014-03-22 02:38:53.693 7916 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Error while processing VIF ports
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 1212, in rpc_loop
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent port_info = self.scan_ports(ports, updated_ports_copy)
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 815, in scan_ports
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent cur_ports = self.int_br.get_vif_port_set()
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 311, in get_vif_port_set
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent port_names = self.get_port_name_list()
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 267, in get_port_name_list
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent res = self.run_vsctl(["list-ports", self.br_name], check_error=True)
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 74, in run_vsctl
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent ctxt.reraise = False
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 82, in __exit__
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/ovs_lib.py", line 67, in run_vsctl
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent return utils.execute(full_args, root_helper=self.root_helper)
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/venvs/neutron/local/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 76, in execute
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent raise RuntimeError(m)
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent RuntimeError:
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'list-ports', 'br-int']
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Exit code: 1
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Stdout: ''
2014-03-22 02:38:53.693 7916 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Stderr: '2014-03-22T02:38:53Z|00001|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)\novs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'

This should result in a retry/sleep loop, not death of neutron-openvswitch-agent.

Tags: ovs
Revision history for this message
Sridhar Gaddam (sridhargaddam) wrote :
Download full text (4.7 KiB)

Hello Clint,

I was trying to reproduce this issue, but was not successful so far.
In order to simulate the error condition I manually terminated the ovsdb-server process.
With this, I could see a similar trace in neutron-openvswitch-agent logs.

2014-03-28 12:18:23.896 ERROR neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Error while processing VIF ports
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last):
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 1213, in rpc_loop
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent port_info = self.scan_ports(ports, updated_ports_copy)
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 816, in scan_ports
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent cur_ports = self.int_br.get_vif_port_set()
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/ovs_lib.py", line 311, in get_vif_port_set
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent port_names = self.get_port_name_list()
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/ovs_lib.py", line 267, in get_port_name_list
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent res = self.run_vsctl(["list-ports", self.br_name], check_error=True)
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/ovs_lib.py", line 74, in run_vsctl
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent ctxt.reraise = False
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/openstack/common/excutils.py", line 82, in __exit__
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent six.reraise(self.type_, self.value, self.tb)
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/ovs_lib.py", line 67, in run_vsctl
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent return utils.execute(full_args, root_helper=self.root_helper)
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 76, in execute
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent raise RuntimeError(m)
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent RuntimeError:
2014-03-28 12:18:23.896 TRACE neutron.plugins.openvswitch.agent.ovs_neutron_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10',...

Read more...

Revision history for this message
Sridhar Gaddam (sridhargaddam) wrote :

From the code I could see that we are handling a generic Exception, so the process should ideally not be terminated.

/opt/stack/neutron/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:rpc_loop()

1263 polling_manager.polling_completed()
1264 except Exception as e:
1265 LOG.exception(_("Error while processing VIF ports"))
1266 # Put the ports back in self.updated_port
1267 self.updated_ports |= updated_ports_copy
1268 sync = True

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Assigned for triaging

Changed in neutron:
importance: Undecided → High
assignee: nobody → Salvatore Orlando (salvatore-orlando)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/110538

Revision history for this message
Robin Wang (robinwang) wrote :

Patch is submitted for review.

Bug 1350179 should be an duplicated one of this bug.
https://bugs.launchpad.net/neutron/+bug/1350179

OVS status is not fully detected, represented and processed. Exception occurs during operations is applied on ovs bridge when ovs is dead.

An easy way to reproduce:
It's an compute node, built on ubuntu 14.04. Neutron code is up to date.

1. Make sure both ovs and neutron-ovs-agent is running;
=====================================
$ service openvswitch-switch status
openvswitch-switch start/running

bengo@Ctl-Nwk-Ubt-Dk-1404:/opt/stack/logs$ ps -e | grep ovs
  932 ? 00:00:02 ovsdb-server
  942 ? 00:02:04 ovs-vswitchd
29210 pts/21 00:00:00 ovsdb-client

$ ps -eLf | grep neutron-openvswitch-agent
bengo 25701 3271 25701 0 1 12:12 pts/13 00:00:00 grep --color=auto neutron-openvswitch-agent
bengo 28927 28926 28927 0 1 03:32 pts/21 00:02:35 python /usr/local/bin/neutron-openvswitch-agent --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini
=====================================

2. Stop ovs service (ovsdb-server/ovs-vswitch/..)
=====================================
$ service openvswitch-switch stop
openvswitch-switch stop/waiting
=====================================

3. Neutron-ovs-agent crashes.
=====================================
$ ps -eLf | grep neutron-openvswitch-agent
bengo 21803 9492 21803 0 1 03:42 pts/12 00:00:00 grep --color=auto neutron-openvswitch-agent
=====================================

Revision history for this message
Robin Wang (robinwang) wrote :

Tested on local box with the patch, ovs-agent doesn't crash again when ovs is dead/restarted.

Stop ovs service, ovs-agent keeps running. Then restart ovs service, ovs-agent continues to process br/ports as expected.

Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → Robin Wang (robinwang)
Revision history for this message
Robin Wang (robinwang) wrote :

Patch in review, proposed in comment #4.
https://review.openstack.org/#/c/110538/

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
assignee: Robin Wang (robinwang) → Eugene Nikanorov (enikanorov)
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Robin Wang (robinwang)
Revision history for this message
Robin Wang (robinwang) wrote :

[1] https://bugs.launchpad.net/neutron/+bug/1296202
[2] https://bugs.launchpad.net/neutron/+bug/1350179
[3] https://bugs.launchpad.net/neutron/+bug/1351135

[Root cause]
1. Currently only a 'restarted' flag is used in rpc_loop() to identify ovs status.
 ovs_restarted = self.check_ovs_restart()

True: ovs is running, but a restart happened before this loop. rpc_loop() reset bridges and re-process ports.
False: ovs is running since last loop, rpc_loop() continue to process in a normal way.

But if ovs is dead, or is not up yet during a restart, check_ovs_restart() will incorrectly returns "True". Then rpc_loop() continues to reset bridges, and apply other ovs operations, till causing exceptions/crash. Related Bug: [1] [2]

2. Also, during agent boot up, ovs status is not checked at all. Agent crashes without no useful log info, when ovs is dead. Related Bug: [3]

[Proposal]
1. Add const {NORMAL, DEAD, RESTARTED} to represent ovs status.
NORMAL - ovs is running since last loop, rpc_loop() continue to process in a normal way.
RESTARTED - ovs is running, but a restart happened before this loop. rpc_loop() reset bridges and re-process ports.
DEAD - keep agent running, but rpc_loop() doesn't apply ovs operations to prevent unnecessary exceptions/crash. When ovs is up, it enters RESTARTED mode;

2. Check ovs status during agent boot up, if it's DEAD, exit graceful since subsequent operations causes a crash, and write log to remind that ovs_dead causes agent termination.

[Code Review] https://review.openstack.org/#/c/110538/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/113450

Revision history for this message
Robin Wang (robinwang) wrote :

So far I've posted two alternative patches, both of them work according to test.

[1] https://review.openstack.org/#/c/110538/
Introduce new constants to fully represent ovs status. Also the status is visible out of "check_ovs_restart" method, not only for solving this bug, but also can be used in future for other purposes.

[2] https://review.openstack.org/#/c/113450/
A simpler resolution. Keep specific status inside "check_ovs_restart" method. If ovs is dead, do loop scanning inside this method, and return True till ovs is up again. The pros is all code/logic outside could keep the same, while cons is not all status is visible outside.

Your review and advice is appreciated. Thanks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Salvatore Orlando (<email address hidden>) on branch: master
Review: https://review.openstack.org/110538
Reason: This patch has been inactive long enough that I think it's safe to abandon.
The author can resurrect it if needed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Salvatore Orlando (<email address hidden>) on branch: master
Review: https://review.openstack.org/113450
Reason: This patch has been inactive long enough that I think it's safe to abandon.
The author can resurrect it if needed.

tags: added: ovs
Changed in neutron:
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/136991

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Robin Wang (robinwang) wrote :

The latest patch in #13 resurrects Patch[2] in comment #10. For details of two alternative patches, refer to comment #10.

The latest patch should resolve this bug. It also addresses review comments on log code from Eugene and Maru.

Revision history for this message
Robin Wang (robinwang) wrote :

As a stability bug with high priority, should we mark it with Kilo Milestone? Thanks.

PS. the patch is ready and in review process. Pass Jenkins.

Kyle Mestery (mestery)
Changed in neutron:
milestone: none → kilo-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/140342

Revision history for this message
Robin Wang (robinwang) wrote :

Per Russell and Kyle's comments in patch https://review.openstack.org/#/c/136991/

Resurrect this alternative approach.Thanks for the review, Russell and Kyle!

Both patches solve the ovs-crash issue, while there's some difference in this one:
https://review.openstack.org/140342
* add constants to fully represent ovs status, and that could be used out of check_ovs_status() method
* wrap the loop count and wait code into a method
* if it returns OVS.DEAD, then do the sleep out of check_ovs_status, wait in rpc_loop, and then continue to next loop

Welcome to give your comments and advice on the two alternative patches. Thanks.

Kyle Mestery (mestery)
Changed in neutron:
milestone: kilo-1 → kilo-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/140342
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=91b7fc7f162751936f7cb15d4add932a4aebd55b
Submitter: Jenkins
Branch: master

commit 91b7fc7f162751936f7cb15d4add932a4aebd55b
Author: RobinWang <email address hidden>
Date: Tue Dec 9 11:52:14 2014 +0800

    Add OVS status and fix OVS crash

    OVS crash/restart is unpredictable, so neutron-ovs-agent should be
    robust enough under that situation. But currently ovs-agent doesn't
    figure out this error status(only check ovs restart/normal status)
    and still continue to apply subsequent operations(set br/add patch
    port/...) till causing exceptions/crash. Add flag to fully represent
    ovs status. Base on that, we can add proper fail-over code in method
    rpc_loop, to treat ovs dead/restart gracefully to prevent agent
    crashes while it is running.
    Closes-bug: #1296202

    Change-Id: Id058b7ddef2ed337627dc692d0418786ad14f4b4

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: master
Review: https://review.openstack.org/136991
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-2 → 2015.1.0
Revision history for this message
Jian Wen (wenjianhn) wrote :

Let it crash!

By doing so the controller knows that the agent is not reliable to handle any further work.

Revision history for this message
Jian Wen (wenjianhn) wrote :

"When using the reference implementation (ovs + neutron-l2-agent), neutron-l2-agent will continue reporting to the Neutron server its state as alive (there's heartbeat; service's up ), although there's an internal error caused by unreachability to the virtual bridge (br-int). "

See bug 1513144

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.