neutron-l3-agent virtual router SNAT translation doesn't work for traffic happening during iptable rules setup (race condition)

Bug #1267931 reported by Miguel Angel Ajo
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

I found a race condition that happens in the following situation:

 1) A network node running neutron-l3-agent with actual traffic is rebooted
 2) While it starts again, an VM is sending traffic (ping is a simple case) to external network
 3) As it starts, it creates the virtual router qrouter-<ID> namespace, brings up the interfaces (ext+int),
     and setups the iptable rules.

 4) if traffic hits the rules, before the SNAT rule is set, the linux
    connection tracker won't ever toss those packets anymore by the
    SNAT/DNAT rule (even if is set after). So it will result from the internal IP being forwarded "as is", untranslated, into the external network.

 5) If you restart the ping in the VM (ping seq restarts to 0), it will start working

 6) If you start a different ping while the first one is running, the new ping will work, the old will
     stay in that "limbo state" where it's untranslated.

 Aditional information:

  This is the normal condition, where a race condition didn't happen: http://fpaste.org/67388/89372153/
  This is the abnormal condition, where the race condition happened: http://fpaste.org/67389/38937224/ (note the last tcpdump source IP)

  This is the abnormal condition, where we started a new ping to a different host: http://fpaste.org/67393/93725511/ (there are two tcpdumps in parallel)

Tags: l3-ipam-dhcp
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I believe we could mitigate this race condition in different ways:

1) Invert the order during qrouter setup:
     a) first, set the iptable rules
     b) then, bring up the interfaces
    This way, the iptable rules will start processing packets once they are all in place

2) Set a DROP rule first, for traffic, then set the actual rules, then remove this DROP barrier
    (not sure if it really mitigates the situation).

3) use conntrack to clear the kernel connection tracking tables after rules setup
     (this could reset any NAT'd connection between the rules set, and the conntrack clear)

description: updated
description: updated
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

rkukura says: (in my own words): we must test what happens about DNAT,

   We could have situations where this traffic is not natted out properly (going
out as the actual floating ip), or not sent back to the running VM.

description: updated
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

with DNAT:

   before network node restart:
     http://www.fpaste.org/67477/38543513/

   after network node restart:
      http://www.fpaste.org/67475/89385263/

tags: added: l3-ipam-dhcp
removed: condition ha iptables race
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Cedric Brandily (cbrandily) wrote :

This bug is > 365 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.