ovs-fw does not reinstate GRE conntrack entry .

Bug #1708731 reported by Aju Francis
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Jakub Libosvar

Bug Description

 *High level description:*

We have VMs running GRE tunnels between them with OVSFW and SG implemented along with GRE conntrack helper loaded on the hypervisor. GRE works as expected but the tunnel breaks whenever there is a neutron ovs agent event causing some exception like the below AMQP timeouts or OVSFW port not found :

AMQP Timeout :

2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID 4035644808d24ce9aae65a6ee567021c
2017-04-07 19:07:03.001 5275 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2017-04-07 19:07:03.003 5275 WARNING oslo.service.loopingcall [-] Function 'neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent._report_state' run outlasted interval by 120.01 sec
2017-04-07 19:07:03.041 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [-] Agent has just been revived. Doing a full sync.
2017-04-07 19:07:06.747 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] rpc_loop doing a full sync.
2017-04-07 19:07:06.841 5275 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Agent out of sync with plugin!

OVSFWPortNOtFound:

2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.firewall.prepare_port_filter(device)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 272, in prepare_port_filter
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent of_port = self.get_or_create_ofport(port)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/openstack/venvs/neutron-14.0.5/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py", line 246, in get_or_create_ofport
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent raise OVSFWPortNotFound(port_id=port_id)
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent OVSFWPortNotFound: Port 01f7c714-1828-4768-9810-a0ec25dd2b92 is not managed by this agent.
2017-03-30 18:31:05.048 5160 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
2017-03-30 18:31:05.072 5160 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-db74f32b-5370-4a5f-86bf-935eba1490d0 - - - - -] Agent out of sync with plugin!

The agent throws out of sync messages and starts to initialize neutron ports once again along with fresh SG rules.

2017-04-07 19:07:07.110 5275 INFO neutron.agent.securitygroups_rpc [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Preparing filters for devices set([u'4b14619f-3b9e-4103-b9d7-9c7e52c797d8'])
2017-04-07 19:07:07.215 5275 ERROR neutron.agent.linux.openvswitch_firewall.firewall [req-521c07b4-f53d-4665-b728-fc5f00191294 - - - - -] Initializing port 4b14619f-3b9e-4103-b9d7-9c7e52c797d8 that was already initialized.

During this process, when it prepares new filters for all ports, its marking the conntrack entry for certain GRE connection(high traffic) as invalid.

root@server:/var/log# conntrack -L -o extended -p gre -f ipv4
ipv4 2 gre 47 178 src=1.1.1.203 dst=2.2.2.66 srckey=0x0 dstkey=0x0 src=2.2.2.66 dst=1.1.1.203 srckey=0x0 dstkey=0x0 [ASSURED] mark=1 zone=5 use=1
ipv4 2 gre 47 179 src=5.5.5.104 dst=4.4.4.187 srckey=0x0 dstkey=0x0 src=4.4.4.187 dst=5.5.5.104 srckey=0x0 dstkey=0x0 [ASSURED] mark=0 zone=5 use=1

And that connection state remains invalid, unless someone reboots the VM, or flushes the connection directly on the conntrack or through OVS.

We have a blanket any protocol any port any IP SG rule during this scenario, we even tried adding specific rules to allow IP 47 for GRE. But nothing fixed this problem.

Was checking for ovs-conntrack helper specific bugs and came across https://patchwork.ozlabs.org/patch/755615/ - is this bug being triggered in the above scenario ? Is this a bug in the ovs-fw code or this something on the ovs-conntrack implementation.

OpenStack Version : Newton.
Hypervisor OS : Ubuntu 16.04.2
Kernel version : 4.4.0-70-generic
OVS version : 2.6.1

William Grant (wgrant)
affects: neutron → null-and-void
information type: Public → Private
Changed in null-and-void:
status: New → Invalid
Colin Watson (cjwatson)
affects: null-and-void → neutron
Changed in neutron:
status: Invalid → New
Jacek Nykis (jacekn)
information type: Private → Public
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Assigning to Jakub for further investigation.

Changed in neutron:
status: New → Triaged
assignee: nobody → Jakub Libosvar (libosvar)
Revision history for this message
Vil Surkin (vill-srk) wrote :

Also affects us. We found this problem happens with every keep-alive connection, like tunnels (not only GRE).

After some investigating we found this: in file neutron/agent/linux/openvswitch_firewall/firewall.py in function OVSFirewallDriver.update_port_filter() there is some time window happened between "delete port rules" and "add port rules". If any packet comes to already established connection between delete/add events, than it marked as invalid (conntrack_mark=1) and future packets dropped by table 82 in OVS.

Any rules update on a port cause such connections stop work.

Revision history for this message
Aju Francis (aju) wrote :

Agreed, Vil Surkin. We just saw this happen on our VXLAN-GPE tunnel. Below was the state of the tunnel when the connection disruption occurred.

ipv4 2 udp 17 178 src=1.1.1.1 dst=2.2.2.2 sport=4790 dport=4790 src=2.2.2.2 dst=1.1.1.1 sport=4790 dport=4790 [ASSURED] mark=1 zone=36 use=1

Had to flush the state on zone 36 and fresh connection was created.

ipv4 2 udp 17 178 src=1.1.1.1 dst=2.2.2.2 sport=4790 dport=4790 src=2.2.2.2 dst=1.1.1.1 sport=4790 dport=4790 [ASSURED] mark=0 zone=36 use=1

Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/540943

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/540943
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6f7ba76075dd0d645ad6cee6854f87cc41cba1fa
Submitter: Zuul
Branch: master

commit 6f7ba76075dd0d645ad6cee6854f87cc41cba1fa
Author: Jakub Libosvar <email address hidden>
Date: Mon Feb 5 17:20:09 2018 +0000

    ovs-fw: Fix firewall blink

    Previously, when security group was updated for given port, the firewall
    removed all flows related to the port and added new rules. That
    introduced a time window where there were no rules for the port.

    This patch adds a new mechanism using cookie that can be described in
    three states:

    1) Create new openflow rules with non-default cookie that is considered
    an updated cookie. All newly generated flows will be added with the next
    cookie and all existing rules with default cookie are rewritten with the
    default cookie.
    2) Delete all rules for given port with the old default cookie. This
    will leave the newly added rules in place.
    3) Update the newly added flows with update cookie back to the default
    cookie in order to avoid such flows being cleaned on the next restart of
    ovs agent, as it fetches for stale flows.

    Change-Id: I85d9e49c24ee7c91229b43cd329c42149637f254
    Closes-bug: #1708731

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/555769

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/555769
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=2c88fcb5a819432697ac57173a4c2023e86b9b6f
Submitter: Zuul
Branch: stable/queens

commit 2c88fcb5a819432697ac57173a4c2023e86b9b6f
Author: Jakub Libosvar <email address hidden>
Date: Mon Feb 5 17:20:09 2018 +0000

    ovs-fw: Fix firewall blink

    Previously, when security group was updated for given port, the firewall
    removed all flows related to the port and added new rules. That
    introduced a time window where there were no rules for the port.

    This patch adds a new mechanism using cookie that can be described in
    three states:

    1) Create new openflow rules with non-default cookie that is considered
    an updated cookie. All newly generated flows will be added with the next
    cookie and all existing rules with default cookie are rewritten with the
    default cookie.
    2) Delete all rules for given port with the old default cookie. This
    will leave the newly added rules in place.
    3) Update the newly added flows with update cookie back to the default
    cookie in order to avoid such flows being cleaned on the next restart of
    ovs agent, as it fetches for stale flows.

    Conflicts:
     neutron/tests/unit/agent/linux/openvswitch_firewall/test_firewall.py

    Change-Id: I85d9e49c24ee7c91229b43cd329c42149637f254
    Closes-bug: #1708731
    (cherry picked from commit 6f7ba76075dd0d645ad6cee6854f87cc41cba1fa)

tags: added: in-stable-queens
Revision history for this message
Jakub Libosvar (libosvar) wrote :
Changed in neutron:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/562220

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/563990

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/562220
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8b2c40366b3b65876e5465efae05b171be1bc473
Submitter: Zuul
Branch: master

commit 8b2c40366b3b65876e5465efae05b171be1bc473
Author: Jakub Libosvar <email address hidden>
Date: Wed Apr 18 10:25:01 2018 +0000

    ovs-fw: Apply openflow rules immediately during update

    Because update operation updates openflow rules three times:
     1) New rules with new cookie
     2) Delete old rules with old cookie
     3) Change new cookie back to old cookie

    and the step 2) uses --strict parameter, it's needed to apply rules
    before deleting the old rules because --strict parameter cannot be
    combined with non-strict. This patch applies openflow rules after
    step 1), then --strict rules in step 2 are applied right away and then
    rest of delete part from 2) and all new rules from 3) are applied
    together.

    This patch adds optional interval parameter to Pinger class which sends
    more ICMP packets per second in the firewall blink tests to increase a
    chance of sending a packet while firewall is in inconsistent state.

    Change-Id: I25d9c87225feda1b5ddd442dd01529424186e05b
    Closes-bug: #1708731

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/563990
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f68e7822f46ec9577e5a866a67cc54b777240486
Submitter: Zuul
Branch: stable/queens

commit f68e7822f46ec9577e5a866a67cc54b777240486
Author: Jakub Libosvar <email address hidden>
Date: Wed Apr 18 10:25:01 2018 +0000

    ovs-fw: Apply openflow rules immediately during update

    Because update operation updates openflow rules three times:
     1) New rules with new cookie
     2) Delete old rules with old cookie
     3) Change new cookie back to old cookie

    and the step 2) uses --strict parameter, it's needed to apply rules
    before deleting the old rules because --strict parameter cannot be
    combined with non-strict. This patch applies openflow rules after
    step 1), then --strict rules in step 2 are applied right away and then
    rest of delete part from 2) and all new rules from 3) are applied
    together.

    This patch adds optional interval parameter to Pinger class which sends
    more ICMP packets per second in the firewall blink tests to increase a
    chance of sending a packet while firewall is in inconsistent state.

    Change-Id: I25d9c87225feda1b5ddd442dd01529424186e05b
    Closes-bug: #1708731

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.2

This issue was fixed in the openstack/neutron 12.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b2

This issue was fixed in the openstack/neutron 13.0.0.0b2 development milestone.

Revision history for this message
Yang Li (yang-li) wrote :

I found another problem about refresh firewall caused tcp link unstable。
env: tenant_network: vlan mode
This is my reproduce step
1.Create a security group named test, with 6 rules which are icmp/tcp/udp ingress/egress passed to cidr 0.0.0.0/0, and a rule which is icmp ingress passed to remote security group test.
2.Create a network and subnet, both named test
3.Create 4 vms(vm1, vm2, vm3, vm4) in compute node-2 with network test and sg test
4.Create 2 vms(vm5, vm6) in compute node-3 with network test and sg test
5.Create a large file in vm5: dd if=/dev/zero of=/mnt/test.img bs=1G count=15
6.Copy the large file into vm1-4 from vm5: scp <vm5-ip>:/mnt/test.img /mnt/
7.Edit vm6's security group, remove the test sg
8.Tailf the openvswitch-agent.log in node-1, you will see "Refresh firewall rules" print
9.Login to vm1-4, you will find the scp process status becomes to stalled

You can do step6-step7 for many times to reproduce the problem.

Seems refresh openflow will cause tcp link unstable.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/651223

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/pike)

Change abandoned by Oleg Bondarev (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/651223
Reason: Requires more patches to be back-ported, including commit 7bd8b37e3863aca2d6cb0195e1df5068b8bfe497 which is hardly back-portable

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.