Allowed address pairs and dvr routers

Bug #1998235 reported by Alexandre Perreault
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

Hi,

I would like to report an issue with neutron port allowed address pairs and DVR routers.

We are currently running Yoga and Ussuri environments.

In Yoga we noticed that if you add an allowed address pair to a neutron port, the DVR router will receive a permanent ARP entry for the IP configured in the allowed address pair. This seems to make sense. In ussuri, the dvr router would not receive an ARP entry for an allowed address pair so this looks like an improvement.

Where it gets more complicated is if you have two neutron ports with an allowed address pair with the same IP. An example use case would be when you have a VIP.
What I have noticed is that the permanent ARP entry learned by the DVR router will be of the latest updated allowed address pair.
For example, if you add allowed address pair with IP X.X.X.X to neutron port 1, the DVR router will have a permanent ARP entry for IP X.X.X.X with the MAC address of neutron port 1.
Then, if you add the same IP X.X.X.X as an allowed address pair to neutron port 2, the DVR router will now have a permanent ARP entry for IP X.X.X.X with MAC address of neutron port 2.
In a way it makes sense since you cannot have two ARP entries for the same IP address but the problem that can occur is that the actual VIP could be on neuton port 1.

This problem becomes apparent with octavia loadbalancers in active_standby topology. On LB creation, both the active and standby instance are created at a very similar time so there is a 50% chance that the LB does not work because the DVR router will have the permanent ARP entry pointing to the backup instance instead of the active one for the reason explained above.

But I think I discovered an even worse problem. Let's say we have the situation I described above. We have neutron port 1 and neutron port 2 and both have an allowed address pair with IP X.X.X.X. Currently, the permanent ARP entry on the DVR router for X.X.X.X is pointing to the MAC of neutron port 2.
If I delete neutron port 2 or remove the allowed address pair from neutron port 2, the permanent ARP entry is erased from the DVR router. And there is no permanent ARP entry for X.X.X.X pointing to neutron port 1. This means traffic won't reach the VIP X.X.X.X located on neutron port 1. This has another impact on an important use case.

Because of the issue with octavia active standby topology, I tried to resolve it using standalone topology. This means there is only one amphora instance. Octavia still uses an allowed address pair but now it only exists on one neutron port so the DVR router has the correct permanent ARP entry.
If I failover the LB for any reason, a new instance is created (new neutron port is created) and is assigned the allowed address pair. The DVR router correctly learns the new permanent ARP entry pointing to the new port. BUT then the broken instance is deleted which means the original neutron port is deleted AND that DELETES the permanent ARP entry on the DVR router even though it was no longer pointing to this port. At this point the LB no longer works because the DVR router does not know how to reach the LB's IP....

I find this very problematic. It means both octavia topologies do not work with DVR routers....
I think the permanent ARP entry logic needs to be revised. Maybe when deleting a permanent ARP entry for an allowed address pair IP address, neutron should double check if that same allowed address pair IP address exists on another neutron port and update the DVR router with this ARP entry.

The first issue I described I am unsure how to resolve. There are other bugs related to it and discussions on how to handle vrrp, etc... dating from a long time. But if we could resolve the second issue about the ARP entry deletion, we could at least use octavia in standalone mode.

I used octavia as the main use cases but there are others and these issues make them hard to handle. It would always require manual intervention to fix. These issues are easy to recreate even without octavia. Let me know if you have any questions.

Thanks

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

This is known bug (see https://bugs.launchpad.net/neutron/+bug/1774459) and it's not easy to fix it really. There are some attempts to solve it: https://review.opendev.org/c/openstack/neutron/+/601336 but this patch requires (again) almost complete rewrite and there is nobody who has some cycles to work on it currently.
One thing I can say is that the same scenario should works fine with ML2/OVN backend where You can also have distributed routers.

tags: added: l3-dvr-backlog
Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Thank you for the response.

It's unfortunate that the problem is still not resolved.

I will do more research on ML2/OVN but we have many production envs with ML2/OVS deployed via Kolla-Ansible and I am unsure how feasible it would be to migrate these production envs to MLS/OVN.

Regards,

Alex

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi, we have the exact Octavia scenario with @alexperreault

@alexperreault stated 2 issues. I think only the 1st issue is a a duplicate of bug #1774459. For the 1st issue, we are aware that there is no easy solution on the code side. It is OK. We can accept to use Octavia without HA (Active/Standby)

But at least we hope to use it with Single Mode.

Like @alexperreault mentioned:

`I think the permanent ARP entry logic needs to be revised. Maybe when deleting a permanent ARP entry for an allowed address pair IP address, neutron should double check if that same allowed address pair IP address exists on another neutron port and update the DVR router with this ARP entry.`

This may be more easier to handle. What do you think about this @slaweq ?

Now we have to run a helper cron script to repair allowed address pairs and it is not cool.

ML2/OVN backend is not a solution for us too. It is not easy to migrate production clusters from ML2/OVS to ML2/OVN.

Should we warn people on documentation before using DVR ML2/OVS about these arp problems?

Thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.