[OVN] Stale ports can be present in OVN NB leading to metadata errors

Bug #1874733 reported by Daniel Alvarez
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Lucas Alvares Gomes

Bug Description

Right now, there's a chance that deleting a port in Neutron with ML2/OVN actually deletes the object from Neutron DB while leaving a stale port in the OVN NB database.

This can happen when deleting a port [0] raises a RowNotFound exception. While it may look like it'd mean that the port didn't exist already in OVN NB truth is that the current port_delete function can throw that exception for different reasons (especially against OVN < 2.10 when Address Sets were used instead of Port Groups).

Such exception can be observed for example if some ACL or Address Set doesn't exist [1][2] amongst others. In this case, the revision number of the object will be deleted [3] and the port will be stale forever in OVN NB (it'll be skipped by the maintenance task).

One of the main impacts of this issue is that the OVN NB database will grow and have stale objects that are undetected (they'll be detected by the neutron-ovn-db-sync-script) but most importantly, that multiple ports in the same OVN Logical Switch may have the same IP addresses and this cause legitimate ports to be left without Metadata.

As per metadata agent code here [4] if more than one port in the same network has the same IP address, a 404 will be returned back to the instance upon requesting metadata.

The workaround is running the neutron-db-sync script in repair mode to get rid of the stale ports.

A proper fix would involve a better granularity of the exceptions that can happen around a port deletion and acting accordingly upon each of them. In the worst case, we won't be deleting the revision number if the port still exists leaving up to the Maintenance task to fix it later on (< 5 minutes). Ideally, we should identify all possible code paths and delete the port from OVN whenever possible even if some other associated operation fails (with proper logging).

Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10).

Danie Alvarez

[0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
[1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
[2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
[3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
[4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86

tags: added: ovn
Changed in neutron:
assignee: nobody → Maciej Jozefczyk (maciej.jozefczyk)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/722789

Changed in neutron:
assignee: Maciej Jozefczyk (maciej.jozefczyk) → Lucas Alvares Gomes (lucasagomes)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/722793

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/723404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/722789
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ef2260441d56036b42fc71b0f03b70a4a02fd954
Submitter: Zuul
Branch: master

commit ef2260441d56036b42fc71b0f03b70a4a02fd954
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Apr 24 14:39:20 2020 +0100

    [OVN] Do not delete port's revision on RowNotFound

    The delete_port() method from OVNClient has a potential problem of
    leaving stale ports when RowNotFound is raised from the process to
    delete the port from the OVN database. Since the exception is not
    granular enough, the RowNotFound could be raised from other objects that
    are part of the same transaction (such as ACLs, DNS entries, etc...)
    resulting in the revision for the port being deleted even tho the port
    is still in the database.

    Instead of giving a pass on the RowNotFound exception, this patch is
    logging the error and re-raising it without deleting the revision.

    Change-Id: I25b93b7c080403fc38365b638e4e03298b447d0f
    Partial-Bug: #1874733
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/724825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/722793
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=591adfee971906dc86e9357903f164d01657ed65
Submitter: Zuul
Branch: master

commit 591adfee971906dc86e9357903f164d01657ed65
Author: Daniel Alvarez <email address hidden>
Date: Fri Apr 24 15:33:11 2020 +0200

    [OVN][metadata] Adding ERROR trace upon unexpected data

    When a metadata request arrives to the metadata agent, OVN
    will need to figure out which port it corresponds to and it'll do
    based on the network ID and IP address which should be a unique pair.

    If it's not unique or it doesn't exist, then there is an error
    and an ERROR trace is logged.

    Change-Id: Id83c2899a03af98e32b0be7ba6ad661e646f07bc
    Related-Bug: 1874733
    Signed-off-by: Daniel Alvarez <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Maciej Józefczyk (<email address hidden>) on branch: master
Review: https://review.opendev.org/723404

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/724825
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8b234d8786072c9bc83d19e0cde3927fc5ccf8f8
Submitter: Zuul
Branch: master

commit 8b234d8786072c9bc83d19e0cde3927fc5ccf8f8
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 1 13:12:53 2020 +0100

    [OVN]: Make _delete_port() more error-resilent

    This patch is making the transaction from the _delete_port() method in
    OVNClient more resilient to errors where elements from that transaction
    have already been deleted by another change in the database.

    Prior to this patch, a few places could potentially raise RowNotFound
    which would abort the whole transaction and would leave the a stable
    port for being cleaned up after by the maintenance thread. This patch
    tries to catch those exceptions that could potentially fail the
    transaction.

    Change-Id: I8fd1d1485269d23529a19085bd4aa4c6c74f5f91
    Partial-Bug: #1874733
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/727978

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/727979

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/727979
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc223bcb9354367d2b5da4ef5ec7a9f39ef9a3b4
Submitter: Zuul
Branch: stable/ussuri

commit bc223bcb9354367d2b5da4ef5ec7a9f39ef9a3b4
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri May 1 13:12:53 2020 +0100

    [OVN]: Make _delete_port() more error-resilent

    This patch is making the transaction from the _delete_port() method in
    OVNClient more resilient to errors where elements from that transaction
    have already been deleted by another change in the database.

    Prior to this patch, a few places could potentially raise RowNotFound
    which would abort the whole transaction and would leave the a stable
    port for being cleaned up after by the maintenance thread. This patch
    tries to catch those exceptions that could potentially fail the
    transaction.

    Change-Id: I8fd1d1485269d23529a19085bd4aa4c6c74f5f91
    Partial-Bug: #1874733
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit 8b234d8786072c9bc83d19e0cde3927fc5ccf8f8)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/727978
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=14aea286e699ea0a2bb9aff56c76420763edadc8
Submitter: Zuul
Branch: stable/ussuri

commit 14aea286e699ea0a2bb9aff56c76420763edadc8
Author: Daniel Alvarez <email address hidden>
Date: Fri Apr 24 15:33:11 2020 +0200

    [OVN][metadata] Adding ERROR trace upon unexpected data

    When a metadata request arrives to the metadata agent, OVN
    will need to figure out which port it corresponds to and it'll do
    based on the network ID and IP address which should be a unique pair.

    If it's not unique or it doesn't exist, then there is an error
    and an ERROR trace is logged.

    Change-Id: Id83c2899a03af98e32b0be7ba6ad661e646f07bc
    Related-Bug: 1874733
    Signed-off-by: Daniel Alvarez <email address hidden>
    (cherry picked from commit 591adfee971906dc86e9357903f164d01657ed65)

tags: added: neutron-proactive-backport-potential
Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/828798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/828798
Committed: https://opendev.org/openstack/neutron/commit/f016548d2edb17e531f4ee29fdeb75b123d31f27
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit f016548d2edb17e531f4ee29fdeb75b123d31f27
Author: Lucas Alvares Gomes <email address hidden>
Date: Fri Apr 24 14:39:20 2020 +0100

    [OVN] Do not delete port's revision on RowNotFound

    The delete_port() method from OVNClient has a potential problem of
    leaving stale ports when RowNotFound is raised from the process to
    delete the port from the OVN database. Since the exception is not
    granular enough, the RowNotFound could be raised from other objects that
    are part of the same transaction (such as ACLs, DNS entries, etc...)
    resulting in the revision for the port being deleted even tho the port
    is still in the database.

    Instead of giving a pass on the RowNotFound exception, this patch is
    logging the error and re-raising it without deleting the revision.

    Change-Id: I25b93b7c080403fc38365b638e4e03298b447d0f
    Partial-Bug: #1874733
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.