OpenVSwitch with LACP sometimes stops accepting ARPs

Bug #1272842 reported by Matthew Mosesohn
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Andrey Danin

Bug Description

OpenVSwitch with LACP sometimes stops accepting ARPs.
Fuel version 4.0 somehow customised. Ubuntu cluster.
This was reproduced only on a particular customer deployment with dual 10G NICs. The behavior is as follows:
1 - Bond is up and operational and acting normal
2 - Several hours (or even days) pass
3 - Connectivity on all bridges attached to this bond fail. ARP requests go out, but reply is filtered by OVS.
Example error in log:
2014-01-21T23:00:39Z|04956|ofproto_dpif|WARN|in_port(3),eth(src=2e:f3:36:37:29:40,dst=ff:ff:ff:ff:ff:ff),eth_type(0x0806),arp(sip=10.119.238.65,tip=10.119.238.34,op=1,sha=2e:f3:36:37:29:40,tha=00:00:00:00:00:00): inconsistency in subfacet (actions were: push_vlan(vid=30,pcp=0),1,13,16,5) (correct actions: push_vlan(vid=30,pcp=0),1,12,16,5)
4 - Issue continues to exist indefinitely

Workaround: ifconfig (phys interface) down, service openvswitch restart, then ifconfig (phys interface) up

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

One proposed workaround is to try deploying a newer version of OpenVSwitch to see if it fixes the issue.

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Another workaround is to use Linux kernel bonding instead of OVS.

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

If Linux kernel bonding is chosen as a workaround for this problem, please make sure the solution implementation doesn't prevent this bug from being fixed:
https://bugs.launchpad.net/fuel/+bug/1271274

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Dmitry B, we didn't find anything conclusive about Linux vs OVS bonds. I agree that expanding the bond limitations in 1271274 should be done regardless of the outcome of this bug.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Do we have any hardware lab where we can reproduce this issue?

Changed in fuel:
status: Triaged → Confirmed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

No, we can't replicate the same network configuration in our internal lab.

tags: added: customer-found
Changed in fuel:
milestone: 4.1 → 5.0
status: Confirmed → Incomplete
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

Now we have two versions of OVS:
UBUNTU 1.10.1 (fetched from git repo at 23/08/2013)
CENTOS 1.10.2 (I don't know where we get it from)
But offsite http://openvswitch.org/download/ says that there are an LTS version 1.9.3 and two oficial subversons of 1.10.x - 1.10.0 and 1.10.2.
Perhaps we use wrong OVS version for Ubuntu.

Changed in fuel:
milestone: 5.0 → 4.1
importance: Medium → High
status: Incomplete → Confirmed
assignee: Fuel Library Team (fuel-library) → Andrey Danin (gcon-monolake)
Revision history for this message
Vladimir Kuklin (vkuklin) wrote : Re: OpenVSwitch with bonding sometimes stops accepting ARPs

We do have 1.9.0 kernel and 1.10 userspace on Ubuntu.
Anyway, we should find a way to build 1.9.3 for both distributions.

Changed in fuel:
importance: High → Critical
summary: - OpenVSwitch with LACP bonding sometimes stops accepting ARPs
+ OpenVSwitch with bonding sometimes stops accepting ARPs
description: updated
description: updated
summary: - OpenVSwitch with bonding sometimes stops accepting ARPs
+ OpenVSwitch with LACP sometimes stops accepting ARPs
description: updated
Revision history for this message
Andrey Danin (gcon-monolake) wrote :

We returned back at the begining. We can not reproduce the bug in our lab still. Next time it occures we should dump OpenFlow rules, try to renegotiate LACP.
In next version of Fuel we will have a new versions of OVS.

Changed in fuel:
importance: Critical → High
status: Confirmed → Incomplete
milestone: 4.1 → 5.0
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

We have any changes about this problem or still incomplete?

Revision history for this message
Mike Scherbakov (mihgen) wrote :

There was no activity on this bug for a while. If you reproduce it again, please reopen.

Changed in fuel:
status: Incomplete → Invalid
Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

Dear Team,

I am using Fuel 5.1.1
Ubuntu HA Cluster with GRE
Ceph for all

Switch:
Dell S4810P with lacp for all nodes (separate port-channels)

Controllers:
LACP 2 x 10Gbit interfaces per node (3 controllers)

I see too many logs like this : http://paste.openstack.org/show/198488/

I am under investigation of connectivity issues between nodes like the ones you describe after ever X days.

ovs-appctl lacp/show
http://paste.openstack.org/show/198491/

packages installed:
-------------------------------
neutron-plugin-openvswitch 1:2014.1.3-fuel5.1.2~mira4
neutron-plugin-openvswitch-agent 1:2014.1.3-fuel5.1.2~mira4
openvswitch-common 1.10.1+git20130823-0ubuntu3~cloud0
openvswitch-datapath-lts-saucy-dkms 1.10.2-0ubuntu2~ubuntu12.04.1
openvswitch-switch 1.10.1+git20130823-0ubuntu3~cloud0

thank you

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

also i saw some disengaging info on the bonding interface

http://paste.openstack.org/show/198496/

Revision history for this message
Vasilios Tzanoudakis (vtzanoudakis) wrote :

I think I have hit this LACP bug twice this month caused a cloud down (cause of galera lost sync).

Anyone knows if openvswitch-switch 1.10.1+git20130823-0ubuntu3~cloud0 is affected?

Thank you

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.