conntrack race can blackhole flows to Floating IP

Bug #1689952 reported by Simon Leinen
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned

Bug Description

We have some users who want to receive continuous unidirectional flows of UDP-over-IPv4 datagram on their instances (sent by some sort of sensors) via Floating IP. After we migrate or restart the Neutron routers serving those instances, the users complain that their instances stop receiving those packets.

After debugging this for a long time, we have observed that there are incorrect conntrack entries for those flows in the router's namespace. Apparently these conntrack entries don't NAT the Floating IP to the instance's Fixed IP. When we delete the conntrack entries, they are quickly replaced with the correct entries, and the instance starts receiving traffic again.

  $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L -d 192.0.2.67
  ...
  udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=192.0.2.67 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
  ...

Note that the original "src" is identical to the response "dst".

After deleting the entries (sudo ip netns exec ... conntrack -D -d 192.0.2.67), the (new) entries look like this:

  $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L conntrack -d 192.0.2.67
  ...
  udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=10.0.0.107 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
   ...

These entries are much better, because the response "dst" is now the Fixed IP of the instance (10.0.0.107).

We assume that there is a race condition: When packets for a given Floating IP arrive at the router namespace before the NAT rules(?) for that Floating IP have been completely set up, conntrack creates these incorrect entries. This is likely if these packets arrive at a high rate (we have hundreds of those packets per second). And the incorrect entries will never time out if the traffic flows continuously.

We have observed this frequently over the years, including recently after we upgraded our network nodes to Newton.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers