Occasional network interruption with mark=1 in conntrack

Bug #1719769 reported by Jesse
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

If VM port's security group rules update frequently and network traffic is heavy.
There will be situation that OvS security group flows wrongly mark the conntrack to 1 and block the VM network connectivity.

If there are 2 VMs, VM A(192.168.111.234) and VM B(192.168.111.233), B allow ping from A.
We ping B from A forever.
There will be one conntrack rule in VM B's compute Host.
icmp 1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=0 zone=1 use=2

I try to simulate this issue because it's hard to reproduce this issue in normal way.
There is one precondition to notice:
If SG rules change on a port, SG flows on this port will be recreated.
Although all SG flows for this port will be added into OvS flows by
command 'ovs-ofctl add-flows' one-off, but flows will actually be
added into OvS flows one by one.

It's hard to reproduce this issue if we do not hack the codes.
So I disable security group defer in codes to simulate. (change codes here: https://github.com/openstack/neutron/blob/master/neutron/agent/securitygroups_rpc.py#L132)

Then I start neutron-openvswitch-agent with breakpoint on https://github.com/openstack/neutron/blob/master/neutron/agent/linux/openvswitch_firewall/firewall.py#L1004

Now we will get mark=1 conntrack rule in VM B's compute Host:
icmp 1 29 src=192.168.111.234 dst=192.168.111.233 type=8 code=0 id=29697 src=192.168.111.233 dst=192.168.111.234 type=0 code=0 id=29697 mark=1 zone=1 use=1

Here after the port's security group rules flows added later, this mark=1 conntrack rule will not deleted only if timeout for this rule.

In our OpenStack production environment, we encounter this issue and our vital system network disconnected.
The reason is that the VM port security rule change frequently and VM network traffic is heavy.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/507725

Changed in neutron:
assignee: nobody → Jesse (jesse-5)
status: New → In Progress
Revision history for this message
Jesse (jesse-5) wrote :

If port's security group rules flows recreated, there will be blink (https://github.com/openstack/neutron/blob/master/neutron/agent/linux/openvswitch_firewall/firewall.py#L506)

        self.delete_all_port_flows(old_of_port)
        self.initialize_port_flows(of_port)
        self.add_flows_from_rules(of_port)

1. There is no VM network disconnection when delete_all_port_flows(old_of_port).
2. In self.initialize_port_flows(of_port), the first 2 flows: Identify egress flow and Identify ingress flows will make the VM packets send to table 71 and table 81. But there is only drop flow in table 71 and table 81 for now. So the VM network is disconnected right now.
After all flows added for this port with add_flows_from_rules(of_port). The connection is recovered.
Although the time is fast for flows recreation, there is still situation that VM network is disconnected. For example if OvS is slow to add flows.
So I suggest if we can put Identify egress flow and Identify ingress flows after all flows created.
This will lead to the situation that VM has no security protected in a short time. But I think it's better to disconnect VM connectivity.

And this suggestion will also fix this bug.

Revision history for this message
Brian Haley (brian-haley) wrote :

This looks like a duplicate of https://bugs.launchpad.net/neutron/+bug/1708731 of which a fix has been release. Can you confirm that in your environment?

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Jesse (jesse-5) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/507725
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.