Openvswitch/OVS Agent/DHCP Agent startup timing

Bug #1274671 reported by Ken Schroeder
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cisco Openstack
Expired
Medium
Unassigned

Bug Description

We’ve come a cross a startup timing issue with neutron-openvswitch-agent and dhcp-agents with underlying OVS on havana. Using COI.H1 to deploy our environment the network services seems that the default provider for startup ordering is defaulting to System V rc-scripts. All the neutron and openvswitch services are getting startup scripts created as /etc/rc2.d/S20* . With a manual install of Havana the rc2.d scripts don’t’ exists and it is solely using upstart. What we're seeing is most of the time dhcp-agent and openvswitch agents are coming up before openvswitch and not able to attach to the bridge properly on startup and therefore we’re having issues with flows getting and instance connectivity work properly. We’ve captured below in dhcp.log. If we remove the rc2.d/S20* startup scripts. If we restart neutron-openvswitch-agent and neutron-dhcp-agent when seeing this problem connectivity is getting established.

2014-01-28 18:02:15.456 13310 ERROR neutron.common.legacy [-] Skipping unknown group key: firewall_driver
2014-01-28 18:02:16.447 13310 ERROR neutron.agent.dhcp_agent [-] Unable to enable dhcp.
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent Traceback (most recent call last):
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/dhcp_agent.py";, line 126, in call_driver
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent getattr(driver, action)(**action_kwargs)
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py";, line 167, in enable
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent reuse_existing=True)
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/dhcp.py";, line 702, in setup
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent namespace=network.namespace)
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/interface.py";, line 161, in plug
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent self.check_bridge_exists(bridge)
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent File "/usr/lib/python2.7/dist-packages/neutron/agent/linux/interface.py";, line 102, in check_bridge_exists
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent raise exceptions.BridgeDoesNotExist(bridge=bridge)
2014-01-28 18:02:16.447 13310 TRACE neutron.agent.dhcp_agent BridgeDoesNotExist: Bridge br-int does not exist.

Revision history for this message
Ken Schroeder (kschroed) wrote :
Revision history for this message
Ken Schroeder (kschroed) wrote :
Changed in openstack-cisco:
status: New → Triaged
milestone: none → h.2
importance: Undecided → Medium
Revision history for this message
Mark T. Voelker (mvoelker) wrote :

Could you confirm:
  * what scenario you're using for this deployment
  * are you using Cisco or UCA packaging

I wasn't able to replicate this in a deployment this afternoon, but it smells like a race condition of sorts per your analysis above. The manifests we're using here don't set the provide for service ensures specifically, which means it's defaulting to whatever the system default is (Puppet's docs are confusing on this point and imply the default is two different things, so I'll have to do a bit more digging). If that's indeed the case, then this might be resolvable by either disabling the ensure completely or changing the provider to upstart for these packages. Neither is super clean, so we may want to confirm that's what the problem is and/or investigate other options as well.

Revision history for this message
Ken Schroeder (kschroed) wrote :

We're using Cisco packaging. The defaults provider for openvswitch and neutron agents via the puppet modules seems to be System V with everything at S20*. Changing to upstart has had little improvement in the problem. We have been working through some additional test cases which include L3 Agent configurations and it seems plausible this may be side effect misconfiguration that has caused some routing-loop. We should leave this open for week or so to run through some additional validation and check on parallel new environment behavior.

Revision history for this message
Louis Watta (lwatta) wrote :

I will note that we also tried using "emit" and pre-script options inside of the upstart scripts to force DHCP agent and Plugin to wait until openvswitch had started.

This did not work. It appears that openvswitch spawns several processes at once. Upstart attempts to follow the first process that is launched and unfortunately it immediately spawns another job and then returns. Our only other option seems to be putting sleep commands in the pre-script section of dhcp agent and the plugin.

However we have not seen the problem since fixing the L3 agent loop issue, but there should be a way to force the other agents to wait until openvswitch is truly up before starting.

Revision history for this message
Louis Watta (lwatta) wrote :

On startup we seem to be ok now. We believe the problem on startup was related to an improper OVS configuration. We've fixed the config and restarts work ok.

We do however still have a problem where anytime openvswitch is restarted the flows will disappear in OVS and not come back until we restart the neutron plugin.

We either need ovs to restart neutron-plugin on restart or for the plugin to periodically check ovs.

Revision history for this message
Chris Ricker (chris-ricker) wrote :

when you say restart are you restarting via puppet or via init scripts?

Changed in openstack-cisco:
status: Triaged → Incomplete
Revision history for this message
Louis Watta (lwatta) wrote :

Both puppet and init scripts cause the problem

Revision history for this message
Chris Ricker (chris-ricker) wrote :

Can you attach logs from a puppet run with the issue?

Changed in openstack-cisco:
milestone: h.2 → i.0
Changed in openstack-cisco:
milestone: i.0 → i.1
Changed in openstack-cisco:
milestone: i.1 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Cisco Openstack because there has been no activity for 60 days.]

Changed in openstack-cisco:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.