[OVN] Deadlock when starting neutron server, during the OVN hash ring deletion

Bug #1990174 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Lucas Alvares Gomes

Bug Description

Related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2125842

Description of problem:
Neutron server often fails to start and systemd needs to restart it. This is a problem at scale because all workers need to reconnect again to the OVN DBs.

How reproducible:
50%

Steps to Reproduce:
1. Start neutron server

Error log: https://paste.opendev.org/show/bm3jZZ1oWX7ihK8JXzdE/

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
importance: Undecided → Medium
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

The "ovn_hash_ring" register deletion [1] is done during the OVN mech driver initialization. This initialization is performed before any API worker is spawned. That means the Neutron service (and its workers) cannot interfere on the database access.

However, in an HA scenario, several servers can be rebooted at the same time. If the OVN database sync is not performed quickly, the API workers will call "_load_hash_ring" [2] each time a new event is received in order to check if it is needed to notify it. This method executes a Neutron database (NOTE 1) read, on the "ovn_hash_ring" table. In some environments, we have seen up to 1/4 million calls to this method, only from one single controller node, during the transient period while the OVN database is locally cached.

These continuous Neutron database read calls can lock a write (delete) operation, as reported. The proposed idea (that will be posted in a patch) is to move the table clean up method out of the mech driver initialization and add is as a periodic method in the "HashRingHealthCheckPeriodics" class.

The issue to be resolved is how to deal with stale/old "ovn_hash_ring" registers before the clean up method has been called.

NOTE 1: Neutron database != OVN database. The Neutron database is the SQL type database that stores the Neutron information, not related to any specific network backend.

[1]https://github.com/openstack/neutron/blob/0a6b9cc39527c13fc663709ae9da3ac033407423/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L128
[2]https://github.com/openstack/neutron/blob/d0e33b9f3212d30654470ea5f218c6a8fb1662bf/neutron/common/ovn/hash_ring_manager.py#L75-L94
[3]https://github.com/openstack/neutron/blob/d0e33b9f3212d30654470ea5f218c6a8fb1662bf/neutron/db/ovn_hash_ring_db.py#L61

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/858542

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/858542
Committed: https://opendev.org/openstack/neutron/commit/819a1bb3e6f3b10a1887e2ef836c138e02f8b996
Submitter: "Zuul (22348)"
Branch: master

commit 819a1bb3e6f3b10a1887e2ef836c138e02f8b996
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue Sep 20 13:32:04 2022 +0200

    Move the "ovn_hash_ring" clean up to maintenance worker

    The "ovn_hash_ring" procedure to clean up the stale/old registers
    is now executed on the ``HashRingHealthCheckPeriodics`` class, tha
    is executed on the ``MaintenanceWorker`` process.

    In a HA scenario, if several servers are rebooted at the same time,
    the "ovn_hash_ring" clean up operation can clash with API worker
    method "_load_hash_ring", that executed a SQL read from this table.
    In some high loaded environments, if the OVN database takes time
    to be locally cached, this read operation is executed thousand of
    times; basically any time an OVN database event occurs.

    In order to avoid/skip a deadlock when deleting the "ovn_hash_ring"
    table, this clean up is executed in a periodic task. If this task
    succeeds, the task is stopped. If the task raises a database
    exception, it is processed again.

    Now the "ovn_hash_ring" registers are retrieved using the
    "created_at" time as a filter. The initial time is taken when the
    OVN mechanism driver is initilized, before any API worker is spawned
    and any new "ovn_hash_ring" register has been created (an API
    worker, when started, will create a new "ovn_hash_ring" register).
    Any stale/old register stored in this table will be ignored; that
    means any register created before the OVN mechanism driver was
    started.

    Closes-Bug: #1990174

    Change-Id: I07c4cb6e20b8a84e4ace7a8e34555aced5b5da9f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Re-opening this as the fix has been reverted at https://review.opendev.org/c/openstack/neutron/+/858908

Changed in neutron:
status: Fix Released → Confirmed
assignee: Rodolfo Alonso (rodolfo-alonso-hernandez) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/860934

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Thanks Lucas for taking care of this LP bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/860934
Committed: https://opendev.org/openstack/neutron/commit/b7b8f7c571440577a40aacf9d8d93abc3a5a48b3
Submitter: "Zuul (22348)"
Branch: master

commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/861381

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/neutron/+/861382

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/neutron/+/861383

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/neutron/+/861384

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/neutron/+/861385

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/neutron/+/861386

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861381
Committed: https://opendev.org/openstack/neutron/commit/3f8b96ec647eb4d660064204019f3a0a8e18e4b6
Submitter: "Zuul (22348)"
Branch: stable/zed

commit 3f8b96ec647eb4d660064204019f3a0a8e18e4b6
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861386
Committed: https://opendev.org/openstack/neutron/commit/2adb471f7575ddcd6af146f84f23bb6f916f79b9
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 2adb471f7575ddcd6af146f84f23bb6f916f79b9
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861383
Committed: https://opendev.org/openstack/neutron/commit/921713cbdcf8e96a8435349325c068df209571af
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 921713cbdcf8e96a8435349325c068df209571af
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861382
Committed: https://opendev.org/openstack/neutron/commit/dd7fc476841649df31e868fb607475f9b2adb61a
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit dd7fc476841649df31e868fb607475f9b2adb61a
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861384
Committed: https://opendev.org/openstack/neutron/commit/049435eeabaef2f2feedff6889d99ff5426a4fdf
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 049435eeabaef2f2feedff6889d99ff5426a4fdf
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/861385
Committed: https://opendev.org/openstack/neutron/commit/c336814d0dcdb6016bdd3e6f5547214ff8d2f27b
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit c336814d0dcdb6016bdd3e6f5547214ff8d2f27b
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Oct 11 11:08:23 2022 +0100

    [OVN] Avoid deadlock when cleaning hash ring nodes

    This patch avoids the clash of the hash ring cleaning operation and the
    API workers by ensuring that the cleaning happens before the nodes for
    that host are added to the ring and the connections to the OVSDBs (meaning
    no events therefore no SELECTS on the hash ring table for that hostname).

    This patch does this by re-using the same hash ring lock that starts
    the probing thread. Now, the first worker that acquire the lock is
    responsible for cleaning the hash ring for it's own host as well as
    starting the probing thread. Subsequently workers only need to register
    themselves to the hash ring.

    Conflicts:
      neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py

    Change-Id: Iba73f7944592a003232eb397ba1d4da3dcba5c3a
    Closes-Bug: #1990174
    Signed-off-by: Lucas Alvares Gomes <email address hidden>
    (cherry picked from commit b7b8f7c571440577a40aacf9d8d93abc3a5a48b3)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 18.6.0

This issue was fixed in the openstack/neutron 18.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn train-eol

This issue was fixed in the openstack/networking-ovn train-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 19.5.0

This issue was fixed in the openstack/neutron 19.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1

This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 20.3.0

This issue was fixed in the openstack/neutron 20.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 21.1.0

This issue was fixed in the openstack/neutron 21.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ussuri-eol

This issue was fixed in the openstack/neutron ussuri-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron victoria-eom

This issue was fixed in the openstack/neutron victoria-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.