Race condition between neutron-plugin-openvswitch-agent and ovsdb-server

Bug #1552017 reported by Raul Flores
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
High
Rodion Tikunov

Bug Description

It seems there's a possible race condition between neutron-plugin-openvswitch-agent and ovsdb-server in MOS 6.1 with Ubuntu as the host OS at scale.

[Problem]

When compute nodes are rebooted the neutron-plugin-openvswitch-agent seems to require access to /var/run/openvswitch/db.sock which depends on ovsdb-server. The problem appears in situations when the db.sock is not available until some time (usually 8-10 seconds) after system boot. In this situation the neutron-plugin-openvswitch-agent service fails to start with the following errors:

=========
2016-02-26 22:35:47.069 6350 TRACE neutron Traceback (most recent call last):
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/bin/neutron-openvswitch-agent", line 10, in <module>
2016-02-26 22:35:47.069 6350 TRACE neutron sys.exit(main())
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 1869, in main
2016-02-26 22:35:47.069 6350 TRACE neutron agent = OVSNeutronAgent(**agent_config)
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 333, in __init__
2016-02-26 22:35:47.069 6350 TRACE neutron self.setup_integration_br()
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py", line 872, in setup_integration_br
2016-02-26 22:35:47.069 6350 TRACE neutron self.int_br.set_secure_mode()
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ovs_lib.py", line 136, in set_secure_mode
2016-02-26 22:35:47.069 6350 TRACE neutron check_error=True)
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ovs_lib.py", line 79, in run_vsctl
2016-02-26 22:35:47.069 6350 TRACE neutron ctxt.reraise = False
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/openstack/common/excutils.py", line 82, in __exit__
2016-02-26 22:35:47.069 6350 TRACE neutron six.reraise(self.type_, self.value, self.tb)
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ovs_lib.py", line 72, in run_vsctl
2016-02-26 22:35:47.069 6350 TRACE neutron return utils.execute(full_args, root_helper=self.root_helper)
2016-02-26 22:35:47.069 6350 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 84, in execute
2016-02-26 22:35:47.069 6350 TRACE neutron raise RuntimeError(m)
2016-02-26 22:35:47.069 6350 TRACE neutron RuntimeError:
2016-02-26 22:35:47.069 6350 TRACE neutron Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', '--', 'set-fail-mode', 'br-int', 'secure']
2016-02-26 22:35:47.069 6350 TRACE neutron Exit code: 1
2016-02-26 22:35:47.069 6350 TRACE neutron Stdout: ''
2016-02-26 22:35:47.069 6350 TRACE neutron Stderr: 'ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)\n'
=========

The service does not recover from this and needs to be manually started in order to get the compute node back to a healthy state.

[Work Around]
As a work around we modified /etc/init/neutron-plugin-openvswitch-agent.conf by adding a 15 second sleep requirement in the pre-start script. Since adding this we have been able to reboot the compute nodes in this environment with out the service failing to start.

Changed in fuel:
milestone: none → 6.1-updates
assignee: nobody → MOS Maintenance (mos-maintenance)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Setting bug importance to High, since it could lead to cloud malfunction.

Changed in fuel:
importance: Medium → High
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Raul, Is it a customer-found bug, or you have found it in a lab environment?

Revision history for this message
Raul Flores (raul-flores11) wrote :

This bug was discovered in a customer's production cloud.

tags: added: customer-found
Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Rodion Tikunov (rtikunov)
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix proposed to mos/mos-docs (stable/6.1)

Related fix proposed to branch: stable/6.1
Change author: Rodion Tikunov <email address hidden>
Review: https://review.fuel-infra.org/22423

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Related fix merged to mos/mos-docs (stable/6.1)

Reviewed: https://review.fuel-infra.org/22423
Submitter: Svetlana Karslioglu <email address hidden>
Branch: stable/6.1

Commit: 98294ce288339d62cae5baf544614f8290b3f684
Author: Rodion Tikunov <email address hidden>
Date: Thu Jun 23 09:15:56 2016

[6.1] Add known issue with WA

There is an issue with nova-compute and neutron-plugin-openvswitch-agent
which sometimes do not start because of race condition in service start
order.

Change-Id: I616a7087d281e1ea08190d76b60904a69bcb4f9d
Related-bug: #1540648
Related-bug: #1552017

Revision history for this message
Rodion Tikunov (rtikunov) wrote :
Changed in fuel:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.