DVR Connection to external network lost when associating a floating IP

Bug #1456624 reported by Itzik Brown on 2015-05-19
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Medium
Unassigned

Bug Description

In DVR, when a floating ip is associated with a port, the current connection( ssh or ping) to external network will be hung(and unresponsive).

The connection may be any TCP, UDP, ICMP connections which are tracked in conntrack.

Having a distributed router with interfaces for an internal network and external network.

When Launching a instance and pinging an external network and then associating a floating to the instance the connection is lost i.e.
 the ping fails.
When running the ping command again - it's successful.

Version
======
RHEL 7.1
python-nova-2015.1.0-3.el7ost.noarch
python-neutron-2015.1.0-1.el7ost.noarch

How to reproduce
==============
1. Create a distributed router and attach an internal and an external network to it.
    # neutron router-create --distributed True router1
    # neutron router-interface-add router1 <subnet1 id>
    # neutron router-gateway-set <external network id>

2. Launch an instance and associate it with a floating IP.
    # nova boot --flavor m1.small --image fedora --nic net-id=<internal network id> vm1

3. Go to the console of the instance and run ping to an external network:
     # ping 8.8.8.8

4. Associate a floating IP to the instance:
     # nova floating-ip-associate vm1 <floatingip-address>

5. Verify that the ping fails.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
status: New → In Progress
description: updated
venkata anil (anil-venkata) wrote :

Change https://review.openstack.org/#/c/199196/ submitted to fix this bug.

Change abandoned by venkata anil (<email address hidden>) on branch: master
Review: https://review.openstack.org/196054
Reason: Submitted alternate solution https://review.openstack.org/#/c/199196/ to avoid connection loss.

Ryan Moats (rmoats) wrote :

Unable to reproduce this issue with devstack/trusty tahr, so I'm unconvinced this is a real bug

venkata anil (anil-venkata) wrote :

This issue is always reproducible with devstack.
I followed in bug description from 1 to 4 in the same order.
Issue ping to external network from vm first, ping will be sucess. Don't kill this ping and let it run continuously.
Then create a floating ip for this vm.
After this operation the previous ping will be in hung state.

So existing connection is lost.

Ryan Moats (rmoats) wrote :

Changed the summary because it is a bit misleading and I'm not still not sure it is a real defect.

The connection set up is an SNAT connection going through the network node, while the FIP gets assigned on the compute node the instance is running on. I'm not sure how you expect that to continue working after the FIP is assigned.

Side note: disassociating the FIP leads to the SNAT connection to pick up again.

venkata anil (anil-venkata) wrote :

I am not sure if the user is really bothered whether a VM is connected to external network through SNAT or FIP namespace.
But he will see "existing"connections to VMs will be hung when associating a floating ip, with existing code.

Here we have the issue for only "existing" connections to VMs i.e VM has already initiated and having a succesful conntrack connection to external network.

To avoid this hanging of existing connections while associating floatingip, we can think about two soultions -
1) Break all existing connections and let user again manually try connection to external world.
     I have proposed a change for this https://review.openstack.org/#/c/196054/ .
     But Oleg Bondarev, Ryan Moats, Armando Migliaccio, Kevin Benton were not OK with approach were strongly favouring with continuing the existing connections.

2) Continue the existing connections.
    As many reviewers were favoring to continue with existing approach, I proposed https://review.openstack.org/#/c/199196

venkata anil (anil-venkata) wrote :

Problem:-

In DVR, when a floating ip is associated with a port, existing connections to external network will be hung(and unresponsive).
The connection may be any TCP, UDP, ICMP connections which are tracked in conntrack.

Packet routing to external network before assigning floating ip:-
Before assigning floating-ip, existing connections will be using SNAT gateway to reach external network because of [2].

Packet routing to external network after assigning floating ip:-
When a floating ip is associated, packets to external network are routed through fip namespace because of ip rule[3].
This rule[3] has higher priority over [2]. And then SNAT iptables rule [4] is applied on packets.

Root cause of the issue:-
But for existing connections to external network only ip rule [3] is applied and
 SNAT [4] is not applied because of connection tracking entry [5].

So these packets enter into fip namespace and to external network with local ip and will never get reply packets, hence the
connections will be in hung state.

[2] 167772161: from 10.0.0.1/24 lookup 167772161
    ip route show table 167772161
    default via 10.0.0.7 dev qr-8804a198-ea
[3] 32777: from 10.0.0.8 lookup 16
[4] SNAT all -- * * 10.0.0.8 0.0.0.0/0 to:172.168.1.16
[5] icmp 1 29 src=10.0.0.8 dst=8.8.8.8 [UNREPLIED] src=8.8.8.8 dst=10.0.0.8 mark=0 use=1

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/199196
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Needs a new owner

Changed in neutron:
assignee: venkata anil (anil-venkata) → nobody
status: In Progress → Incomplete
venkata anil (anil-venkata) wrote :

I have to only resolve merge conflicts. I will do that.

Changed in neutron:
assignee: nobody → venkata anil (anil-venkata)
status: Incomplete → In Progress

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/199196
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: In Progress → Incomplete
assignee: venkata anil (anil-venkata) → nobody
status: Incomplete → Confirmed
Carl Baldwin (carl-baldwin) wrote :

In my review of the patch, I stated that I think the cure is much worse than the problem. I don't think anyone has chimed in to change my mind and so I'm marking this as won't fix. Ping me if you think it should be fixed.

Changed in neutron:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers