[L2] dataplane down during ovs-agent restart

Bug #1803919 reported by LIU Yulong
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
LIU Yulong

Bug Description

ENV:
neutron: stable/queens
tenant network type: vlan
provider network type: vlan
kernel: 3.10.0-862.3.2.el7.x86_64

Problem description:
This is an extremly case for neutron ovs-agent during restart.
(1) condition 1: tenant network and provider network share the physic NIC, aka send the traffic to the same physic NIC, so the brige mapping will be: br-provider:bond1. No other mappings.
(2) condition 2: Neutron-servers are all down, or message queue is down.
Then, restart the L2 ovs-agent, the dataplane will down.

This issue was seen during a large deployment upgrading procedure, when restart neutron-server and ovs-agent synchronously, some ovs-agent will get message timeout, and the VM traffic is down.

Code digging:
stable/queens and master branch has basicly same procedure for this issue.
The ovs-agent init procedure has a call for `setup_physical_bridges`, it has two drop flows:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1225-L1226
After this two drop flows installed, the VMs traffic will go down.
If the MQ or neutron server is not up, the VM will be unreachable. Until the MQ or neutron server are all up, the ovs-agent will require a manually restart again to recover the traffic.

LIU Yulong (dragon889)
description: updated
Changed in neutron:
assignee: nobody → LIU Yulong (dragon889)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/618720

Changed in neutron:
status: New → In Progress
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

can you explain how the drop flows in setup_physical_bridges makes VMs unreachable?
i expected that old flows remained until cleanup_stale_flows is called.

Changed in neutron:
status: In Progress → Incomplete
Revision history for this message
LIU Yulong (dragon889) wrote :

@YAMAMOTO Takashi (yamamoto),
It is really simple to reproduce the issue, stop the neutron server, and set a bridge mapping for ovs-agent.
Then restart the ovs-agent.

Here is my test.
the br-int flows:
 cookie=0x6067905158bfb7c1, duration=3.361s, table=0, n_packets=0, n_bytes=0, idle_age=65534, priority=2,in_port=124 actions=drop
 cookie=0x6067905158bfb7c1, duration=3.318s, table=0, n_packets=2987, n_bytes=892106, idle_age=51, priority=2,in_port=1 actions=drop
 cookie=0x159f3b920557a968, duration=246304.113s, table=1, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=1 actions=drop
 cookie=0x159f3b920557a968, duration=246304.109s, table=2, n_packets=0, n_bytes=0, idle_age=65534, hard_age=65534, priority=1 actions=drop
 cookie=0x6067905158bfb7c1, duration=3.600s, table=23, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop
 cookie=0x6067905158bfb7c1, duration=3.595s, table=24, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop

br-int ports:
 1(int-br-provider): addr:c2:6d:a1:e7:0e:51
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 124(int-br-vlan): addr:ba:08:e1:81:82:60
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max

Cookie 0x6067905158bfb7c1 has the drop flows installed during the restart.
You may notice that this flow has drop packets now.
cookie=0x6067905158bfb7c1, duration=3.318s, table=0, n_packets=2987, n_bytes=892106, idle_age=51, priority=2,in_port=1 actions=drop

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

LIU,

my understanding is:
- those drop flows were there even before a restart
- the problem is that cleanup_stale_flows is called before new flows are ready

is it right?

Revision history for this message
LIU Yulong (dragon889) wrote :
Download full text (3.1 KiB)

YAMAMOTO,

Yes, br-int drop flows will be there before restart. As I said in the `Bug Description`, there are two drop flow install code for each side of bridge-mapping. I will explain more below.
``cleanup_stale_flows`` will not run because OVSNeutronAgent does not finish init, so that rpc_loop will not start.

Furthermore, in comment #3, I just added the br-int drop flows. But it has drop flow from other side bridges, for instance br-ex. This flow can also drop the traffic from VM to outside world, or VM to VM traffic in such share NIC scenario. Traffic from br-int to br-ex will be drop by this:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1226

Here is the br-ex flows example:
[yulong@compute2 ~]$ sudo ovs-ofctl show br-ex
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000080027466da9
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(enp0s8): addr:08:00:27:46:6d:a9
     config: 0
     state: 0
     current: 1GB-FD COPPER AUTO_NEG
     advertised: 10MB-HD 10MB-FD 100MB-HD 100MB-FD 1GB-FD COPPER AUTO_NEG
     supported: 10MB-HD 10MB-FD 100MB-HD 100MB-FD 1GB-FD COPPER AUTO_NEG
     speed: 1000 Mbps now, 1000 Mbps max
 2(phy-br-ex): addr:62:1e:46:b3:e6:82
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(br-ex): addr:08:00:27:46:6d:a9
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
[yulong@compute2 ~]$
[yulong@compute2 ~]$
[yulong@compute2 ~]$ sudo ovs-ofctl dump-flows br-ex
NXST_FLOW reply (xid=0x4):
 cookie=0xd77d26e720225238, duration=7.258s, table=0, n_packets=819, n_bytes=36570, idle_age=0, priority=2,in_port=2 actions=drop
 cookie=0x3a011e5d76fd5e16, duration=584.888s, table=0, n_packets=64939, n_bytes=13902213, idle_age=58, priority=1 actions=resubmit(,3)
 cookie=0xd77d26e720225238, duration=7.432s, table=0, n_packets=2, n_bytes=112, idle_age=65534, priority=0 actions=NORMAL
 cookie=0x3a011e5d76fd5e16, duration=584.836s, table=1, n_packets=489, n_bytes=22710, idle_age=331, priority=0 actions=resubmit(,2)
 cookie=0x3a011e5d76fd5e16, duration=573.436s, table=2, n_packets=480, n_bytes=21246, idle_age=331, priority=4,in_port=2,dl_vlan=3 actions=mod_vlan_vid:2001,NORMAL
 cookie=0x3a011e5d76fd5e16, duration=584.779s, table=2, n_packets=9, n_bytes=1464, idle_age=46594, priority=2,in_port=2 actions=drop
 cookie=0x3a011e5d76fd5e16, duration=584.209s, table=3, n_packets=0, n_bytes=0, idle_age=65534, priority=2,dl_src=fa:16:3f:08:8f:35 actions=output:2
 cookie=0x3a011e5d76fd5e16, duration=583.959s, table=3, n_packets=0, n_bytes=0, idle_age=65534, priority=2,dl_src=fa:16:3f:d1:f9:ac actions=output:2
 cookie=0x3a011e5d76fd5e16, duration=583.566s, table=3, n_packets=0, n_bytes=0, idle_age=65534, priority=2,dl_src=fa:16:3f:dd:0e:2d actions=output:2
 cookie=0x3a011e5d76fd5e16, duration=584.707s, table=3, n_packets=64939, n_bytes=13902213, idle_age=58, priority=1 actions=NOR...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/618720
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0385868848f8c18c8a37fd4c661d1b1a5078e044
Submitter: Zuul
Branch: master

commit 0385868848f8c18c8a37fd4c661d1b1a5078e044
Author: LIU Yulong <email address hidden>
Date: Thu Nov 15 17:49:12 2018 +0800

    Check if agent can reach neutron server

    The ovs agent will install some basic drop flows first for the
    physical bridge mappings during the init procedure. If message
    queue is not connected, or neutron-servers are all down, real
    traffic flows will not be refreshed anymore. This will cause
    the data plane down if tenant network and provider network are
    sharing the physical NICs.

    This patch adds a RPC check during init L2 agent. When restart
    the ovs-agent, if the MQ is OK and we have available neutron-server,
    go next step. Otherwise, a rpc timeout will be raised. L2 agent
    will start fail, physical bridge mapping drop flows will not be
    installed. The original flows will not be replaced, so the traffic
    can still work properly.

    Closes-Bug: #1803919
    Change-Id: Ie15cf625b3710eaf290d6aafecb3f65df664b9df

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/624850

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/624851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/625132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/rocky)

Change abandoned by LIU Yulong (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/624850

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens)

Change abandoned by LIU Yulong (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/624851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/pike)

Change abandoned by Bernard Cafarelli (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/625132
Reason: Includes a RPC version change

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/741444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Rodolfo Alonso Hernandez (<email address hidden>) on branch: master
Review: https://review.opendev.org/741444
Reason: Superseded by https://review.opendev.org/#/c/740724/. Nice to see a better option.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.