[ovn] Stale ports in the OVN database at churn

Bug #1960006 reported by Daniel Alvarez
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Unassigned

Bug Description

There are situations where, under a lot of control plane activity, OVN ports will stale and won't get cleaned up (unless the neutron-ovn-db-sync tool is run manually).

A possible scenario for this is:

a) Port creation
  a.1) Port created in Neutron DB
  a.b) Port created in OVN Northbound (NB) database.
  a.c) NB ovsdb-server will notify of the port creation to all the connected workers
  a.d) Each worker will eventually process this event and update their in-memory copy of the NB database

Immediately, the port gets deleted via API but the previous a.d) step hasn't been completed by all workers. Then the port deletion API request falls into one of those workers that haven't yet updated their in-memory OVN NB database copy with the newly created port.

b) Port deletion
  b.1) Port deleted from Neutron DB
  b.2) Port attempted to be deleted from OVN NB but lookup fails and its revision number is deleted [0]

At this point, the port will stale forever in the OVN database causing other issues that we have mitigated (eg. [1]) but ultimately the number of OVN resources may grow to a point that can affect very negatively to the overall cluster stability and performance.

A potential workaround to this problem might be to run the neutron-ovn-db-sync tool periodically to get rid of those but it is not recommended to do so while the API is operational.

[0] https://github.com/openstack/neutron/blob/f5030b0bc25216d80b09f7ac3938c9a902b655e3/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L698
[1] https://bugs.launchpad.net/neutron/+bug/1874733

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/827834

Changed in neutron:
status: New → In Progress
tags: added: ovn
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/827834
Committed: https://opendev.org/openstack/neutron/commit/be7331c8169c53e3900c9c1a08e12808cf5ed2ec
Submitter: "Zuul (22348)"
Branch: master

commit be7331c8169c53e3900c9c1a08e12808cf5ed2ec
Author: Daniel Alvarez Sanchez <email address hidden>
Date: Fri Feb 4 11:32:47 2022 +0100

    [ovn] Prevent stale ports in the OVN database

    Under a lot of load, there can be situations where all the Neutron
    workers have not updated their in-memory copy of the NB database
    in time before certain operations.

    This scenario can lead to stale resources when a somewhat recently
    created port is attempted to be deleted, but the worker handling
    this deletion doesn't know about the OVN port yet.

    This patch detects this condition and allows some time (at least one
    maintenance task cycle) before it deletes the OVN revision number.
    If the port then shows up in the OVN database within that window, then
    it will be deleted later by the maintenance task avoiding the stale
    ports. If not, the revision number row will be deleted and we won't
    stale these entries either.

    Closes-Bug: #1960006
    Signed-off-by: Daniel Alvarez Sanchez <email address hidden>
    Change-Id: Ie4093dc6cd63b89e3a62363a4f805ef8287d15b9

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/828758

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/828796

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/828797

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/828799

Revision history for this message
Frode Nordahl (fnordahl) wrote :

We've been trying this patch to see if it helps avoid stale ports on a busy CI cloud.

Unfortunately it does not appear to resolve the issue in our environment.

As noted in [2] stale ports can have quite adverse side effects when a duplicate IP is registered in the OVN DB, i.e. newly created instances can appear to suddenly loose connectivity to the router and subsequently the outside world (North/South not working).

I wonder if you would accept a patch for the alternative approach of doing pre-flight check prior to inserting LSP's into the OVN DB to help avoid introducing duplicate IPs? We could mimic the behavior of the ``ovn-nbctl`` tool [3].

This pre-flight check could also be an entry point for scheduling removal of stale ports when a duplicate IP is unexpectedly found.

2: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1961046
3: https://github.com/ovn-org/ovn/blob/ed81be75e8b3b33745eeb9b6ce2686b87ef72cd0/utilities/ovn-nbctl.c#L1392-L1433

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828758
Committed: https://opendev.org/openstack/neutron/commit/d67b05ddea02dbe5d7a0563fa2f37f472c945e40
Submitter: "Zuul (22348)"
Branch: stable/xena

commit d67b05ddea02dbe5d7a0563fa2f37f472c945e40
Author: Daniel Alvarez Sanchez <email address hidden>
Date: Fri Feb 4 11:32:47 2022 +0100

    [ovn] Prevent stale ports in the OVN database

    Under a lot of load, there can be situations where all the Neutron
    workers have not updated their in-memory copy of the NB database
    in time before certain operations.

    This scenario can lead to stale resources when a somewhat recently
    created port is attempted to be deleted, but the worker handling
    this deletion doesn't know about the OVN port yet.

    This patch detects this condition and allows some time (at least one
    maintenance task cycle) before it deletes the OVN revision number.
    If the port then shows up in the OVN database within that window, then
    it will be deleted later by the maintenance task avoiding the stale
    ports. If not, the revision number row will be deleted and we won't
    stale these entries either.

    Closes-Bug: #1960006
    Signed-off-by: Daniel Alvarez Sanchez <email address hidden>
    Change-Id: Ie4093dc6cd63b89e3a62363a4f805ef8287d15b9
    (cherry picked from commit be7331c8169c53e3900c9c1a08e12808cf5ed2ec)

tags: added: in-stable-xena
tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828796
Committed: https://opendev.org/openstack/neutron/commit/4b06bf4d9fb949a7d816cc619ce4bead28d2850f
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 4b06bf4d9fb949a7d816cc619ce4bead28d2850f
Author: Daniel Alvarez Sanchez <email address hidden>
Date: Fri Feb 4 11:32:47 2022 +0100

    [ovn] Prevent stale ports in the OVN database

    Under a lot of load, there can be situations where all the Neutron
    workers have not updated their in-memory copy of the NB database
    in time before certain operations.

    This scenario can lead to stale resources when a somewhat recently
    created port is attempted to be deleted, but the worker handling
    this deletion doesn't know about the OVN port yet.

    This patch detects this condition and allows some time (at least one
    maintenance task cycle) before it deletes the OVN revision number.
    If the port then shows up in the OVN database within that window, then
    it will be deleted later by the maintenance task avoiding the stale
    ports. If not, the revision number row will be deleted and we won't
    stale these entries either.

    Conflicts:
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py

    Closes-Bug: #1960006
    Signed-off-by: Daniel Alvarez Sanchez <email address hidden>
    Change-Id: Ie4093dc6cd63b89e3a62363a4f805ef8287d15b9
    (cherry picked from commit be7331c8169c53e3900c9c1a08e12808cf5ed2ec)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828797
Committed: https://opendev.org/openstack/neutron/commit/29b08d9765c79285ee899d5b0166728cec69e56a
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 29b08d9765c79285ee899d5b0166728cec69e56a
Author: Daniel Alvarez Sanchez <email address hidden>
Date: Fri Feb 4 11:32:47 2022 +0100

    [ovn] Prevent stale ports in the OVN database

    Under a lot of load, there can be situations where all the Neutron
    workers have not updated their in-memory copy of the NB database
    in time before certain operations.

    This scenario can lead to stale resources when a somewhat recently
    created port is attempted to be deleted, but the worker handling
    this deletion doesn't know about the OVN port yet.

    This patch detects this condition and allows some time (at least one
    maintenance task cycle) before it deletes the OVN revision number.
    If the port then shows up in the OVN database within that window, then
    it will be deleted later by the maintenance task avoiding the stale
    ports. If not, the revision number row will be deleted and we won't
    stale these entries either.

    Conflicts:
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py

    Closes-Bug: #1960006
    Signed-off-by: Daniel Alvarez Sanchez <email address hidden>
    Change-Id: Ie4093dc6cd63b89e3a62363a4f805ef8287d15b9
    (cherry picked from commit be7331c8169c53e3900c9c1a08e12808cf5ed2ec)

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

@Frode, let me try to understand why this doesn't work for you.

while it's true that the patch doesn't prevent the stale ports immediately, it'll ensure that in the next maintenance task cycle(s), they'll be cleaned up.

* The problem that you're still seeing with my patch is that they stale forever (ie. don't get cleaned up)? In this case I'd say it's a bug that we should fix quickly.

* Or more precisely, that when there's a stale port during X minutes, it creates this issue [0] which is a problem for that period of time? Can you confirm if this is the case, that after that time window, all is ok?

Now, If I understand correctly what you're proposing to avoid duplicated IP addresses is something that may have negative effects. From what we could see, the stale ports correspond to deleted instances so they VMs are not running anymore and hence the ports are unbound and there are no physical flows in the hypervisor for them.

With the current code base, even with stale ports you can still boot instances and the stale resources will get cleaned up eventually but if you add the pre-flight check, you'll prevent new instances from being created until the maint task fixes the issue.

Can you please elaborate further?

Thanks!
daniel

[0] https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1961046

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828799
Committed: https://opendev.org/openstack/neutron/commit/8164c27d9ad1de198a5943d75ceff473d0218b3d
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 8164c27d9ad1de198a5943d75ceff473d0218b3d
Author: Daniel Alvarez Sanchez <email address hidden>
Date: Fri Feb 4 11:32:47 2022 +0100

    [ovn] Prevent stale ports in the OVN database

    Under a lot of load, there can be situations where all the Neutron
    workers have not updated their in-memory copy of the NB database
    in time before certain operations.

    This scenario can lead to stale resources when a somewhat recently
    created port is attempted to be deleted, but the worker handling
    this deletion doesn't know about the OVN port yet.

    This patch detects this condition and allows some time (at least one
    maintenance task cycle) before it deletes the OVN revision number.
    If the port then shows up in the OVN database within that window, then
    it will be deleted later by the maintenance task avoiding the stale
    ports. If not, the revision number row will be deleted and we won't
    stale these entries either.

    Conflicts:
            neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py

    Closes-Bug: #1960006
    Signed-off-by: Daniel Alvarez Sanchez <email address hidden>
    Change-Id: Ie4093dc6cd63b89e3a62363a4f805ef8287d15b9
    (cherry picked from commit be7331c8169c53e3900c9c1a08e12808cf5ed2ec)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.0.0.0rc1

This issue was fixed in the openstack/neutron 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 17.4.0

This issue was fixed in the openstack/neutron 17.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.3.0

This issue was fixed in the openstack/neutron 18.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.2.0

This issue was fixed in the openstack/neutron 19.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn train-eol

This issue was fixed in the openstack/networking-ovn train-eol release.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

@Daniel

Thank you for responding, I'd have to check more in detail about whether stale ports are left around indefinitely or not, but I know that the operators of the ci cloud in question has to periodically run a script to remove stale ports, which may suggest some resources are missed. I'll endeavor to check.

More importantly, I disagree with the premise of the current approach. You can technically boot an instance which uses a resource that currently has duplicates in the OVN DB, but the fact that the instance does not have N/S connectivity for a long period of time, makes that point moot.

The stale resource most likely belonged to a deleted VM as you point out, but that does not matter when Neutron allocates the same IP for use with a new instance before the stale resource has been cleaned out.

For the thing that deploys the instance, let's say a functional test executor, this just becomes a weird failure scenario. It can talk to the instance, but the instance cannot talk to the internet to install packages, download OCI images or anything like that etc.

Even if this situation would have been corrected after 5 minutes, that is not going to solve the problem. The whole test execution has most likely stalled or failed long before those 5 minutes have passed, and as a consequence the cloud is perceived as unreliable.

I'm not quite sure what negative effects you refer to that adding a pre-flight check would cause?

If you are referring to that a negative response from the API would cause issues for the caller, I actually think that would be better than to provision something that does not actually work. The caller can always retry the API request.

I would hope for the code change to be able to amend the situation rather than returning a negative response though.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ussuri-eol

This issue was fixed in the openstack/neutron ussuri-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.