Missing secure fail mode on physical bridges

Bug #1607787 reported by Hynek Mlnarik
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Hynek Mlnarik

Bug Description

Restarting current ovs neutron agent under a heavy load with Ryu (ofctl=native) leads intermittently to disruption of traffic as manifested by occasional failures in not-yet-commited fullstack test [1]. More specifically, the reason seems to be too slow restart of Ryu controller in combination with OVS vswitchd timeouts. The disruption of the traffic occurs always after the following log entry is recorded in ovs-vswitchd.log:

  fail_open|WARN|Could not connect to controller (or switch failed controller's post-connection admission control policy) for 15 seconds, failing open

This issue is manifested regardless of network type (VLAN, flat) and vsctl interface (cli, native). It has not occured with ofctl=cli though.

The issue occurs for physical switches as they are in default fail mode, meaning that once controller connection is lost, ovs takes over the management of the flows and clears them. This conclusion is based on flows dumped from situation just before the traffic was blocked and after that event when flows were cleared. Before the event:

  ====== br-eth23164fdb4 =======
  Fri Jul 29 10:44:04 UTC 2016
  OFPST_FLOW reply (OF1.3) (xid=0x2):
   cookie=0x921fc02d0b4f49e1, duration=16.137s, table=0, n_packets=16, n_bytes=1400, priority=4,in_port=2,dl_vlan=1 actions=set_field:5330->vlan_vid,NORMAL
   cookie=0x921fc02d0b4f49e1, duration=25.647s, table=0, n_packets=6, n_bytes=508, priority=2,in_port=2 actions=drop
   cookie=0x921fc02d0b4f49e1, duration=26.250s, table=0, n_packets=16, n_bytes=1400, priority=0 actions=NORMAL

After the disruption:

  ====== br-eth23164fdb4 =======
  Fri Jul 29 10:44:05 UTC 2016
  OFPST_FLOW reply (OF1.3) (xid=0x2):

The same bug apperars in this condition (courtesy of Jakub Libosvar): setup a phys bridge, block the traffic for the respective bridge controller via iptables until OVS timeout occurs, and then check openflow rules of the affected bridge.

[1] https://review.openstack.org/#/c/334926/

Revision history for this message
Jakub Libosvar (libosvar) wrote :

I'd add that it's not only in restarts but a machine under heavy load can miss probes from ovsdb, that leads to controller disconnect. This also drops flows, so compute node with VLAN or flat network type remain with broken data plane until ovs agent is restarted.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Assaf Muller (amuller) wrote :

I would argue this bug is 'critical'. OVS agent restart without data plane downtime has been a feature since Liberty and is crucial for upgrades. As Jakub points out this bug is relevant not only to restarts but also high load.

Changed in neutron:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/348889

Changed in neutron:
assignee: nobody → Hynek Mlnarik (hmlnarik-s)
status: Confirmed → In Progress
tags: added: ovs
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/353373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/348889
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9429c2da01fa29cedcb2a65a26c1c29d0a713670
Submitter: Jenkins
Branch: master

commit 9429c2da01fa29cedcb2a65a26c1c29d0a713670
Author: Hynek Mlnarik <email address hidden>
Date: Wed Aug 10 10:05:57 2016 +0200

    Set secure fail mode for physical bridges

    Physical bridges can cause network disruption when ofctl controller becomes
    inaccessible due to heavy load or when the traffic to controller is blocked.
    By setting secure fail mode, the openflow rules remain untouched on such
    an event, while with the default setting, the flows are cleared.

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Closes-Bug: 1607787
    Change-Id: I1dffe0a248664d2a675fd1ca58530c233e335d2d
    UpgradeImpact

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/355315

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/355322

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/355322
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cdbb26e87d14c1ba13dbef98111f7e578903bd98
Submitter: Jenkins
Branch: stable/liberty

commit cdbb26e87d14c1ba13dbef98111f7e578903bd98
Author: Hynek Mlnarik <email address hidden>
Date: Wed Aug 10 10:05:57 2016 +0200

    Set secure fail mode for physical bridges

    Physical bridges can cause network disruption when ofctl controller becomes
    inaccessible due to heavy load or when the traffic to controller is blocked.
    By setting secure fail mode, the openflow rules remain untouched on such
    an event, while with the default setting, the flows are cleared.

    Conflicts:
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_tunnel.py

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Closes-Bug: 1607787
    Change-Id: I1dffe0a248664d2a675fd1ca58530c233e335d2d
    (cherry picked from commit 9429c2da01fa29cedcb2a65a26c1c29d0a713670)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/355315
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a08fecc98ecdf28f804590765a9aea7c7a6719d4
Submitter: Jenkins
Branch: stable/mitaka

commit a08fecc98ecdf28f804590765a9aea7c7a6719d4
Author: Hynek Mlnarik <email address hidden>
Date: Wed Aug 10 10:05:57 2016 +0200

    Set secure fail mode for physical bridges

    Physical bridges can cause network disruption when ofctl controller becomes
    inaccessible due to heavy load or when the traffic to controller is blocked.
    By setting secure fail mode, the openflow rules remain untouched on such
    an event, while with the default setting, the flows are cleared.

    Co-Authored-By: Jakub Libosvar <email address hidden>
    Closes-Bug: 1607787
    Change-Id: I1dffe0a248664d2a675fd1ca58530c233e335d2d
    (cherry picked from commit 9429c2da01fa29cedcb2a65a26c1c29d0a713670)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 7.2.0

This issue was fixed in the openstack/neutron 7.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.3.0

This issue was fixed in the openstack/neutron 8.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/353373
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=27aee4a9c53ad9e8916e074f5cb7675871964448
Submitter: Jenkins
Branch: master

commit 27aee4a9c53ad9e8916e074f5cb7675871964448
Author: Hynek Mlnarik <email address hidden>
Date: Wed Aug 10 10:07:42 2016 +0200

    Connectivity tests for OVS agent failures/restarts

    Adding two tests:

    * A test that for native ovs-ofctl interface verifies that stopping the
      ovs-neutron-agent does not disrupt network traffic. Stopping the agent
      means also stopping the OVS bridge controller, hence OVS can decide to
      take over management of OpenFlow rules, clear them up, and this way
      cause network traffic disruption.

    * A test that creates two ports in a single network, then starts
      pinging one from the other while restarting OVS agents. The test verifies
      that no packet is lost during OVS agent restarts.

    Change-Id: I2cd1195fc0622c8c8d614f00e9dd6884ad388d69
    Related-Bug: 1514056
    Related-Bug: 1607787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/462998

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/462998
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=81bb6aa348c05b9cff39ce2acda2bd77e01167c7
Submitter: Jenkins
Branch: stable/ocata

commit 81bb6aa348c05b9cff39ce2acda2bd77e01167c7
Author: Hynek Mlnarik <email address hidden>
Date: Wed Aug 10 10:07:42 2016 +0200

    Connectivity tests for OVS agent failures/restarts

    Adding two tests:

    * A test that for native ovs-ofctl interface verifies that stopping the
      ovs-neutron-agent does not disrupt network traffic. Stopping the agent
      means also stopping the OVS bridge controller, hence OVS can decide to
      take over management of OpenFlow rules, clear them up, and this way
      cause network traffic disruption.

    * A test that creates two ports in a single network, then starts
      pinging one from the other while restarting OVS agents. The test verifies
      that no packet is lost during OVS agent restarts.

    Change-Id: I2cd1195fc0622c8c8d614f00e9dd6884ad388d69
    Related-Bug: 1514056
    Related-Bug: 1607787
    (cherry picked from commit 27aee4a9c53ad9e8916e074f5cb7675871964448)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.