Duplicate IP address on LSPs in the same LS lead to unpredictable flow output

Bug #1961046 reported by Frode Nordahl
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned
ovn (Ubuntu)
New
Undecided
Unassigned

Bug Description

In a busy CI cloud we ran into an issue that took us days to get to the bottom of.

Essentially what happens is that the CMS is sometimes unable to complete deletion of a LSP (being worked on separately [0]).

Subsequently a new LSP is created using the same IP address but a different MAC address. There exists duplicate IP checks in the ovn-nbctl administration tool [1], but the current database schema does not apply any constraints on a CMS adding duplicates and despite [1] citing ovn-northd should log this, it does not appear to do so.

When a system is in the above described state the flow output on hypervisors is unpredictable, and this leads to instances not being able to reach the router nor the outside world.

What happens inside the ovn-controller is that the function looking up installed flows [2] only does a table, priority and miniflow match, and does not consider the possibility of other collisions.

Subsequently an attempt to link an desired flow to an installed flow unexpectedly fails, and the flow is just not processed at all [3].

At the very least I think we need to investigate why this situation is not logged in ovn-northd, but I don't think just logging this is enough.

Is there something we can change in the database model that would prevent a CMS from getting itself into this situation in the first place?

Should we consider also not generating the lflows?

Example lflows:
  table=15(lr_in_arp_resolve ), priority=500 , match=(ip4.mcast || ip6.mcast), action=(next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && reg0== 20.0.0.2), action=(eth.dst = 00:00:00:02:03:05; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "R1_R2" && xxreg0 == fe80::200:ff:fe02:305), action=(eth.dst = 00:00:00:02:03:05; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:01; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:02; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:03; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:04; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:05; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:06; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:07; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:08; next;)
  table=15(lr_in_arp_resolve ), priority=100 , match=(outport == "ls1" && reg0 == 192.168.1.11), action=(eth.dst = f0:00:00:01:01:09; next;)
  table=15(lr_in_arp_resolve ), priority=0 , match=(ip4), action=(get_arp(outport, reg0); next;)
  table=15(lr_in_arp_resolve ), priority=0 , match=(ip6), action=(get_nd(outport, xxreg0); next;)

Resulting flow table:
 cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x2,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
 cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ipv6,metadata=0x1,dl_dst=33:33:00:00:00:00/ff:ff:00:00:00:00,ipv6_dst=ff00::/8 actions=resubmit(,24)
 cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x2,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
 cookie=0xe683b51c, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=500,ip,metadata=0x1,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00,nw_dst=224.0.0.0/4 actions=resubmit(,24)
 cookie=0xed4bbf00, duration=0.120s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020304,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
 cookie=0x601769fe, duration=0.114s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xfe800000,reg1=0,reg2=0x20000ff,reg3=0xfe020305,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
 cookie=0x92b4073a, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000001,reg15=0x2,metadata=0x2 actions=set_field:00:00:00:02:03:04->eth_dst,resubmit(,24)
 cookie=0x87dacd46, duration=0.118s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0x14000002,reg15=0x2,metadata=0x1 actions=set_field:00:00:00:02:03:05->eth_dst,resubmit(,24)
 cookie=0xa7d4ecf1, duration=0.113s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=100,reg0=0xc0a8010b,reg15=0x1,metadata=0x1 actions=set_field:f0:00:00:01:01:04->eth_dst,resubmit(,24)
 cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x2 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
 cookie=0xb595291d, duration=0.128s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ip,metadata=0x1 actions=push:NXM_NX_REG0[],push:NXM_NX_XXREG0[96..127],pop:NXM_NX_REG0[],set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),pop:NXM_NX_REG0[],resubmit(,24)
 cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x2 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
 cookie=0xb7e18e6b, duration=0.127s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,ipv6,metadata=0x1 actions=set_field:00:00:00:00:00:00->eth_dst,resubmit(,66),resubmit(,24)
 cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x4 actions=resubmit(,24)
 cookie=0xb60abbb8, duration=0.125s, table=23, n_packets=0, n_bytes=0, idle_age=0, priority=0,metadata=0x3 actions=resubmit(,24)

As you can see the mac address to program the flow for is chosen at "random":

idle_age=0, priority=100,reg0=0xc0a8010b,reg15=0x1,metadata=0x1 actions=set_field:f0:00:00:01:01:04->eth_dst,resubmit(,24)

0: https://review.opendev.org/c/openstack/neutron/+/827834
1: https://github.com/ovn-org/ovn/commit/738295605b44e878ab6ddbcf3068354988a1c82e
2: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1506-L1523
3: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/controller/ofctrl.c#L1973-L1976

Tags: ovn
Frode Nordahl (fnordahl)
description: updated
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Adding the OpenStack Neutron project to this bug after a discussion with the OVN team [4], which suggest that although the current OVN DB schema allows duplicates to occur it is the CMS's responsibility to avoid them.

There is an related neutron bug 1960006 that did attempt to address a similar situation with stale ports in the OVN DB. However it does not appear to be sufficient to avoid this situation.

I wonder if Neutron should do a pre-flight check along the lines of what the ``ovn-nbctl`` tool does [5].

4: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-February/051738.html
5: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/utilities/ovn-nbctl.c#L1392-L1433

no longer affects: charm-ovn-central
tags: added: ovn
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.