OVS agent will leave compute host in an unsafe state when rpc_setup() fails

Bug #1217980 reported by Stephen Gran
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Stephen Gran
quantum (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Recently we saw a case where startup of the quantum (not yet neutron in our install, although this part of the code hasn't changed) OVS agent on compute hosts was failing due to an unresolvable hostname in the rabbit_host parameter, exiting the agent during setup_rpc(). Unfortunately, on startup the agent reinitialized the OVS flows, so when it exited before making RPC calls, it left the compute host in a state where it wouldn't pass traffic to instances.

My first inclination is to submit a patch moving RPC initialization higher up in __init__, making it fail fast, before it has made any changes to the host system. However, I don't know if this will have knock on effects or be unworkable for some reason I can't see.

Tags: ovs
Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote : Re: [Bug 1217980] [NEW] OVS agent will leave compute host in an unsafe state when rpc_setup() fails

if this problem exists in master branch too, you can fix in master branch
firstly, then consider fix grizzly too

read these links to know about how to contributes
- https://wiki.openstack.org/wiki/How_To_Contribute
- https://wiki.openstack.org/wiki/Gerrit_Workflow

Revision history for this message
Stephen Gran (sgran) wrote :

Yes, it exists in master, and yes, I know how to post a patch.

I'm asking if you can see any side effects of the proposal.

Thank you,

Revision history for this message
Stephen Gran (sgran) wrote :

presumably 'start on started rc' is too coarse for this? I don't actually know.

It seems a bit of a regression to have to give up dependencies between init scripts for upstart. Perhaps some coordination with the openvswitch maintainers might be in order? I can't believe this is that hard to fix.

Revision history for this message
Stephen Gran (sgran) wrote :

arg, sorry - that was for a different bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/44229

Changed in neutron:
assignee: nobody → Stephen Gran (sgran)
status: New → In Progress
Revision history for this message
Sean McCully (sean-mccully) wrote :

 I think this good start, it just makes me wonder what changes made to your system on neutron server start; are left as is, on neutron server stop (i.e. killing the process)?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/44229
Committed: http://github.com/openstack/neutron/commit/d72848c9afba40d21235a4e95cf8e69549290dca
Submitter: Jenkins
Branch: master

commit d72848c9afba40d21235a4e95cf8e69549290dca
Author: Stephen Gran <email address hidden>
Date: Thu Aug 29 07:11:44 2013 +0100

    Create RPC connection before modifying OVS bridges

    On startup, the agent removes and readds flows to the OVS bridges. If
    an RPC setup error exits the process prematurely, this can leave the
    bridges in an unsafe state. It is better to set the RPC communication
    up before making changes to the host system.

    Closes-Bug: 1217980
    Change-Id: Ib9bbb864b9129bb7b1376a150a37a0c07908d74b
    Signed-off-by: Stephen Gran <email address hidden>

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Stephen Gran (sgran) wrote :

We use puppet for configuration management. A typo was accidentally introduced into the config file, so that the rabbit_host entry was no longer a resolvable host. Puppet replaced the working config file with the bad one, and then restarted the service.

Stopping the openvswitch agent has always been fine for us. I've left the service down while, eg, upgrading controllers for periods of a few hours and kept connectivity to the instances.

Revision history for this message
Stephen Gran (sgran) wrote :

I would quite like to see this backported to grizzly. How do I go about doing that, or encouraging that?

Changed in quantum (Ubuntu):
status: New → Fix Committed
Changed in neutron:
importance: Undecided → Medium
milestone: none → havana-3
tags: added: ovs
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-3 → 2013.2
Chuck Short (zulcss)
Changed in quantum (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.