neutron-ovs-cleanup sometimes failed at reboot (Ubuntu 14.04)

Bug #1485422 reported by Christoph Fiehe
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

I am running OpenStack Kilo on Ubuntu 14.04 (Trusty). I have a separate network node running Neutron. Sometimes after reboot the neutron-plugin-openvswitch-agent is not started, whereas the neutron-dhcp-agent and the neutron-l3-agent get started. The problem is that sometimes (not always) the neutron-ovs-cleanup fails to start with the following exceptions:

2015-08-17 06:05:07.091 1470 ERROR neutron.agent.ovsdb.impl_vsctl [-] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', 'list-br'].
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Traceback (most recent call last):
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl File "/usr/lib/python2.7/dist-packages/neutron/agent/ovsdb/impl_vsctl.py", line 63, in run_vsctl
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl log_fail_as_error=False).rstrip()
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 137, in execute
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl raise RuntimeError(m)
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl RuntimeError:
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', 'list-br']
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Exit code: 1
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Stdin:
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Stdout:
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl Stderr: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl
2015-08-17 06:05:07.091 1470 TRACE neutron.agent.ovsdb.impl_vsctl
2015-08-17 06:05:07.098 1470 CRITICAL neutron [-] RuntimeError:
Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', 'list-br']
Exit code: 1
Stdin:
Stdout:
Stderr: ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory)

or

2015-08-16 12:08:57.315 1820 TRACE neutron Traceback (most recent call last):
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/bin/neutron-ovs-cleanup", line 10, in <module>
2015-08-16 12:08:57.315 1820 TRACE neutron sys.exit(main())
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/cmd/ovs_cleanup.py", line 108, in main
2015-08-16 12:08:57.315 1820 TRACE neutron delete_neutron_ports(ports)
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/cmd/ovs_cleanup.py", line 72, in delete_neutron_ports
2015-08-16 12:08:57.315 1820 TRACE neutron device.link.delete()
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 293, in delete
2015-08-16 12:08:57.315 1820 TRACE neutron self._as_root([], ('delete', self.name))
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 222, in _as_root
2015-08-16 12:08:57.315 1820 TRACE neutron use_root_namespace=use_root_namespace)
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 69, in _as_root
2015-08-16 12:08:57.315 1820 TRACE neutron log_fail_as_error=self.log_fail_as_error)
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/ip_lib.py", line 78, in _execute
2015-08-16 12:08:57.315 1820 TRACE neutron log_fail_as_error=log_fail_as_error)
2015-08-16 12:08:57.315 1820 TRACE neutron File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py", line 137, in execute
2015-08-16 12:08:57.315 1820 TRACE neutron raise RuntimeError(m)
2015-08-16 12:08:57.315 1820 TRACE neutron RuntimeError:
2015-08-16 12:08:57.315 1820 TRACE neutron Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'link', 'delete', 'qg-b731d447-d1']
2015-08-16 12:08:57.315 1820 TRACE neutron Exit code: 2
2015-08-16 12:08:57.315 1820 TRACE neutron Stdin:
2015-08-16 12:08:57.315 1820 TRACE neutron Stdout:
2015-08-16 12:08:57.315 1820 TRACE neutron Stderr: RTNETLINK answers: Operation not supported

From my point of view there is a race condition at startup. I have added a "sleep 5" command to my upstart-script neutron-ovs-cleanup.conf and the problem is gone. I don't think this a proper solution.

Has anybody an idea?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

It sounds like this may be an issue with the service definition for the agent. If it is at the same priority of the OVS service, this may happen.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you check to see if the services are configured to start at the same time?

Changed in neutron:
status: New → Incomplete
Revision history for this message
Christoph Fiehe (fiehe) wrote :

Actually neutron-ovs-cleanup should be started after the startup of the OVS service:

neutron-ovs-cleanup.conf
##############################################################################
start on started openvswitch-switch
stop on runlevel [!2345]

pre-start script
  # sleep 5
  [ ! -x /usr/bin/neutron-ovs-cleanup ] && exit 0
  start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-ovs-cleanup -- \
    --log-file /var/log/neutron/ovs-cleanup.log \
    --config-file /etc/neutron/neutron.conf --verbose
end script

I don't know why the neutron-l3-agent and the neutron-dhcp-agent upstart scripts use a different pattern and why they get started even when neutron-ovs-cleanup has failed.

neutron-l3-agent.conf:
##############################################################################
start on runlevel [2345]
stop on runlevel [!2345]

respawn

chdir /var/run

pre-start script
  mkdir -p /var/run/neutron
  chown neutron:root /var/run/neutron
  # Check to see if openvswitch plugin in use by checking
  # status of cleanup upstart configuration
  if status neutron-ovs-cleanup; then
    start wait-for-state WAIT_FOR=neutron-ovs-cleanup WAIT_STATE=running WAITER=neutron-l3-agent
  fi
end script

exec start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-l3-agent -- \
  --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/l3_agent.ini \
  --config-file=/etc/neutron/fwaas_driver.ini --log-file=/var/log/neutron/l3-agent.log

neutron-dhcp-agent.conf:
##############################################################################
description "Neutron DHCP Agent"
author "Chuck Short <email address hidden>"

start on runlevel [2345]
stop on runlevel [!2345]

respawn

chdir /var/run

pre-start script
  mkdir -p /var/run/neutron
  chown neutron:root /var/run/neutron
  # Check to see if openvswitch plugin in use by checking
  # status of cleanup upstart configuration
  if status neutron-ovs-cleanup; then
    start wait-for-state WAIT_FOR=neutron-ovs-cleanup WAIT_STATE=running WAITER=neutron-dhcp-agent
  fi
end script

exec start-stop-daemon --start --chuid neutron --exec /usr/bin/neutron-dhcp-agent -- --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/dhcp_agent.ini --log-file=/var/log/neutron/dhcp-agent.log

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

This bug is incorrect, closing as Invalid.

Changed in neutron:
status: Incomplete → Invalid
status: Invalid → Won't Fix
status: Won't Fix → Invalid
Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

@Migel, why you marked bug as invalid without any comments? Please reopen.

I faced exactly the same issue, and despite the line
"start on started openvswitch-switch"
in neutron-ovs-cleanup.conf it looks like /var/run/openvswitch/db.sock sometimes created 3 seconds later, that neutron-ovs-cleanup job trying ti access it.

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

May it be possible that openvswitch-switch service reports that it started before /var/run/openvswitch/db.sock is created.

See 1 sec difference:

root@node-1:/var/log/openvswitch# cat ovs-ctl.log
Fri Oct 16 20:15:00 UTC 2015:load-kmod
Fri Oct 16 20:15:00 UTC 2015:start --system-id=random
 * Starting ovsdb-server
 * Configuring Open vSwitch system IDs
 * Starting ovs-vswitchd
 * Enabling remote OVSDB managers

root@node-1:/var/log/openvswitch# stat /var/run/openvswitch/db.sock
  File: '/var/run/openvswitch/db.sock'
  Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 10h/16d Inode: 1480509 Links: 1
Access: (0700/srwx------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2015-10-16 20:15:01.107794496 +0000
Modify: 2015-10-16 20:15:01.095794320 +0000
Change: 2015-10-16 20:15:01.095794320 +0000
 Birth: -

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Sorry, *Miguel.

Looks like we are trying to point neutron-ovs-cleanup to start after starting SystemV-init script /etc/init.d/openvswitch-switch, and it shouldn't work, according to my understanding.
root@node-5:/etc/init# grep 'start on' /etc/init/neutron-ovs-cleanup.conf
start on started openvswitch-switch

root@node-5:/etc/init# initctl check-config neutron-ovs-cleanup
neutron-ovs-cleanup
  start on: unknown job openvswitch-switch

Should we change it to
start on started openvswitch-switch-running

and add this line to the very end of start) section of /etc/init.d/openvswitch-switch, right before closing curly brace:
    initctl emit --no-wait openvswitch-switch-running

After this change it started to work for me.

See:
http://upstart.ubuntu.com/cookbook/#creating-a-systemv-service-that-communicates-with-upstart-ubuntu-specific

Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.